Module 2 of 16

Foundations of Search & Retrieval

BM25, TF-IDF, vector search fundamentals, and similarity metrics

3 hours2 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Understand information retrieval fundamentals
Implement keyword search with BM25 and TF-IDF
Understand vector search and similarity metrics
Compare keyword vs semantic search tradeoffs

Before building RAG, you need to understand retrieval. Retrieval is the art of finding the most relevant documents for a given query. Two paradigms dominate: keyword search (exact term matching) and semantic search (meaning-based matching).

Keyword Search: BM25 and TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) scores documents by how often a term appears in them relative to how rare that term is across all documents. BM25 is an improved version that accounts for document length and term saturation. Both match exact terms - fast and precise but blind to meaning.

# BM25 with rank-bm25 library
from rank_bm25 import BM25Okapi

corpus = ["python web framework", "django rest api", "flask application"]
tokenized = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized)

query = "python api framework"
scores = bm25.get_scores(query.split())
# Returns relevance scores for each document

Vector Search: Semantic Similarity

Vector search converts text into high-dimensional numbers (embeddings) where similar meanings produce nearby vectors. A query about "automobile maintenance" will match documents about "car repair" even though they share no words.

Similarity Metrics

Cosine similarity: Measures angle between vectors. Most common for text.
Euclidean distance: Measures straight-line distance. Sensitive to magnitude.
Dot product: Fast, works when vectors are normalized.

Common mistakes

What usually breaks

Using only semantic search (misses exact terms, acronyms, product codes)
Using only keyword search (misses meaning, synonyms, paraphrases)
Not evaluating retrieval quality separately from generation quality
Assuming more retrieved documents = better answers (often the opposite)

Key terms

Vocabulary used in this module

BM25

Best Matching 25 - keyword ranking algorithm based on term frequency

TF-IDF

Term Frequency-Inverse Document Frequency - document relevance scoring

Embedding

Vector representation of text capturing semantic meaning

Cosine Similarity

Metric measuring angle between vectors (1=identical, 0=unrelated)

Labs

Hands-on labs

25 minBeginner

Implement Keyword Search with BM25

Build a keyword search engine from scratch.

Load a document corpus
Tokenize and index with BM25
Query and rank results
Observe limitations with synonym queries

View lab on GitHub

30 minBeginner

Implement Semantic Search

Build vector-based semantic search.

Generate embeddings with sentence-transformers
Store vectors in memory
Query with cosine similarity
Compare results with BM25 on same queries

View lab on GitHub

Recap

Key takeaways

BM25 matches exact terms - fast but misses synonyms
Vector search matches meaning - finds semantic matches but misses exact terms
Cosine similarity is the standard metric for text embeddings
Neither approach alone is sufficient - hybrid search combines both (Module 7)
Understanding retrieval fundamentals is essential before building RAG

Related resources