Module 2 of 16

Foundations of Search & Retrieval

BM25, TF-IDF, vector search fundamentals, and similarity metrics

3 hours2 labsFree

Start here

Learning objectives

  • Understand information retrieval fundamentals
  • Implement keyword search with BM25 and TF-IDF
  • Understand vector search and similarity metrics
  • Compare keyword vs semantic search tradeoffs
KEYWORD SEARCH vs SEMANTIC SEARCHKeyword Search (BM25)Matches exact terms+ Fast, precise for exact matches+ Great for names, codes, IDs- Misses synonyms and meaning- "car" does not match "vehicle"Semantic Search (Vectors)Matches meaning+ Understands synonyms and context+ "car" matches "vehicle"- Misses exact terms, acronyms- Slower, needs embedding modelBest approach: Hybrid Search (Module 7) — combine both

Before building RAG, you need to understand retrieval. Retrieval is the art of finding the most relevant documents for a given query. Two paradigms dominate: keyword search (exact term matching) and semantic search (meaning-based matching).

Keyword Search: BM25 and TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) scores documents by how often a term appears in them relative to how rare that term is across all documents. BM25 is an improved version that accounts for document length and term saturation. Both match exact terms — fast and precise but blind to meaning.

# BM25 with rank-bm25 library
from rank_bm25 import BM25Okapi

corpus = ["python web framework", "django rest api", "flask application"]
tokenized = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized)

query = "python api framework"
scores = bm25.get_scores(query.split())
# Returns relevance scores for each document

Vector Search: Semantic Similarity

Vector search converts text into high-dimensional numbers (embeddings) where similar meanings produce nearby vectors. A query about "automobile maintenance" will match documents about "car repair" even though they share no words.

Similarity Metrics

  • Cosine similarity: Measures angle between vectors. Most common for text.
  • Euclidean distance: Measures straight-line distance. Sensitive to magnitude.
  • Dot product: Fast, works when vectors are normalized.

Common mistakes

What usually breaks

  • Using only semantic search (misses exact terms, acronyms, product codes)
  • Using only keyword search (misses meaning, synonyms, paraphrases)
  • Not evaluating retrieval quality separately from generation quality
  • Assuming more retrieved documents = better answers (often the opposite)

Key terms

Vocabulary used in this module

BM25

Best Matching 25 — keyword ranking algorithm based on term frequency

TF-IDF

Term Frequency-Inverse Document Frequency — document relevance scoring

Embedding

Vector representation of text capturing semantic meaning

Cosine Similarity

Metric measuring angle between vectors (1=identical, 0=unrelated)

Labs

Hands-on labs

25 minBeginner

Implement Keyword Search with BM25

Build a keyword search engine from scratch.

  1. Load a document corpus
  2. Tokenize and index with BM25
  3. Query and rank results
  4. Observe limitations with synonym queries
View lab on GitHub
30 minBeginner

Implement Semantic Search

Build vector-based semantic search.

  1. Generate embeddings with sentence-transformers
  2. Store vectors in memory
  3. Query with cosine similarity
  4. Compare results with BM25 on same queries
View lab on GitHub

Recap

Key takeaways

  • BM25 matches exact terms — fast but misses synonyms
  • Vector search matches meaning — finds semantic matches but misses exact terms
  • Cosine similarity is the standard metric for text embeddings
  • Neither approach alone is sufficient — hybrid search combines both (Module 7)
  • Understanding retrieval fundamentals is essential before building RAG

Related resources

Keep learning across CodersSecret