Module 2: Foundations of Search & Retrieval
BM25, TF-IDF, vector search fundamentals, and similarity metrics
3 hours. 2 hands-on labs. Free course module.
Learning Objectives
- Understand information retrieval fundamentals
- Implement keyword search with BM25 and TF-IDF
- Understand vector search and similarity metrics
- Compare keyword vs semantic search tradeoffs
Why This Matters
RAG is only as good as its retrieval. If you retrieve the wrong documents, the LLM generates answers from irrelevant context. Understanding search fundamentals — keyword vs semantic, precision vs recall — is the foundation of every production RAG system.
Lesson Content
Before building RAG, you need to understand retrieval. Retrieval is the art of finding the most relevant documents for a given query. Two paradigms dominate: keyword search (exact term matching) and semantic search (meaning-based matching).
Keyword Search: BM25 and TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) scores documents by how often a term appears in them relative to how rare that term is across all documents. BM25 is an improved version that accounts for document length and term saturation. Both match exact terms — fast and precise but blind to meaning.
# BM25 with rank-bm25 library
from rank_bm25 import BM25Okapi
corpus = ["python web framework", "django rest api", "flask application"]
tokenized = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized)
query = "python api framework"
scores = bm25.get_scores(query.split())
# Returns relevance scores for each document
Vector Search: Semantic Similarity
Vector search converts text into high-dimensional numbers (embeddings) where similar meanings produce nearby vectors. A query about "automobile maintenance" will match documents about "car repair" even though they share no words.
Similarity Metrics
- Cosine similarity: Measures angle between vectors. Most common for text.
- Euclidean distance: Measures straight-line distance. Sensitive to magnitude.
- Dot product: Fast, works when vectors are normalized.
Common Mistakes
- Using only semantic search (misses exact terms, acronyms, product codes)
- Using only keyword search (misses meaning, synonyms, paraphrases)
- Not evaluating retrieval quality separately from generation quality
- Assuming more retrieved documents = better answers (often the opposite)
Key Terms
- BM25
- Best Matching 25 — keyword ranking algorithm based on term frequency
- TF-IDF
- Term Frequency-Inverse Document Frequency — document relevance scoring
- Embedding
- Vector representation of text capturing semantic meaning
- Cosine Similarity
- Metric measuring angle between vectors (1=identical, 0=unrelated)
Hands-On Labs
-
Implement Keyword Search with BM25
Build a keyword search engine from scratch.
25 min - Beginner
- Load a document corpus
- Tokenize and index with BM25
- Query and rank results
- Observe limitations with synonym queries
-
Implement Semantic Search
Build vector-based semantic search.
30 min - Beginner
- Generate embeddings with sentence-transformers
- Store vectors in memory
- Query with cosine similarity
- Compare results with BM25 on same queries
Key Takeaways
- BM25 matches exact terms — fast but misses synonyms
- Vector search matches meaning — finds semantic matches but misses exact terms
- Cosine similarity is the standard metric for text embeddings
- Neither approach alone is sufficient — hybrid search combines both (Module 7)
- Understanding retrieval fundamentals is essential before building RAG