Skip to main content

Module 2: Foundations of Search & Retrieval

BM25, TF-IDF, vector search fundamentals, and similarity metrics

3 hours. 2 hands-on labs. Free course module.

Learning Objectives

  • Understand information retrieval fundamentals
  • Implement keyword search with BM25 and TF-IDF
  • Understand vector search and similarity metrics
  • Compare keyword vs semantic search tradeoffs

Why This Matters

RAG is only as good as its retrieval. If you retrieve the wrong documents, the LLM generates answers from irrelevant context. Understanding search fundamentals — keyword vs semantic, precision vs recall — is the foundation of every production RAG system.

KEYWORD SEARCH vs SEMANTIC SEARCHKeyword Search (BM25)Matches exact terms+ Fast, precise for exact matches+ Great for names, codes, IDs- Misses synonyms and meaning- "car" does not match "vehicle"Semantic Search (Vectors)Matches meaning+ Understands synonyms and context+ "car" matches "vehicle"- Misses exact terms, acronyms- Slower, needs embedding modelBest approach: Hybrid Search (Module 7) — combine both
Architecture diagram for Module 2: Foundations of Search & Retrieval.

Lesson Content

Before building RAG, you need to understand retrieval. Retrieval is the art of finding the most relevant documents for a given query. Two paradigms dominate: keyword search (exact term matching) and semantic search (meaning-based matching).

Keyword Search: BM25 and TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) scores documents by how often a term appears in them relative to how rare that term is across all documents. BM25 is an improved version that accounts for document length and term saturation. Both match exact terms — fast and precise but blind to meaning.

# BM25 with rank-bm25 library
from rank_bm25 import BM25Okapi

corpus = ["python web framework", "django rest api", "flask application"]
tokenized = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized)

query = "python api framework"
scores = bm25.get_scores(query.split())
# Returns relevance scores for each document

Vector Search: Semantic Similarity

Vector search converts text into high-dimensional numbers (embeddings) where similar meanings produce nearby vectors. A query about "automobile maintenance" will match documents about "car repair" even though they share no words.

Similarity Metrics

  • Cosine similarity: Measures angle between vectors. Most common for text.
  • Euclidean distance: Measures straight-line distance. Sensitive to magnitude.
  • Dot product: Fast, works when vectors are normalized.

Common Mistakes

  • Using only semantic search (misses exact terms, acronyms, product codes)
  • Using only keyword search (misses meaning, synonyms, paraphrases)
  • Not evaluating retrieval quality separately from generation quality
  • Assuming more retrieved documents = better answers (often the opposite)

Key Terms

BM25
Best Matching 25 — keyword ranking algorithm based on term frequency
TF-IDF
Term Frequency-Inverse Document Frequency — document relevance scoring
Embedding
Vector representation of text capturing semantic meaning
Cosine Similarity
Metric measuring angle between vectors (1=identical, 0=unrelated)

Hands-On Labs

  1. Implement Keyword Search with BM25

    Build a keyword search engine from scratch.

    25 min - Beginner

    • Load a document corpus
    • Tokenize and index with BM25
    • Query and rank results
    • Observe limitations with synonym queries

    View lab files on GitHub

  2. Implement Semantic Search

    Build vector-based semantic search.

    30 min - Beginner

    • Generate embeddings with sentence-transformers
    • Store vectors in memory
    • Query with cosine similarity
    • Compare results with BM25 on same queries

    View lab files on GitHub

Key Takeaways

  • BM25 matches exact terms — fast but misses synonyms
  • Vector search matches meaning — finds semantic matches but misses exact terms
  • Cosine similarity is the standard metric for text embeddings
  • Neither approach alone is sufficient — hybrid search combines both (Module 7)
  • Understanding retrieval fundamentals is essential before building RAG