Before building RAG, you need to understand retrieval. Retrieval is the art of finding the most relevant documents for a given query. Two paradigms dominate: keyword search (exact term matching) and semantic search (meaning-based matching).
Keyword Search: BM25 and TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) scores documents by how often a term appears in them relative to how rare that term is across all documents. BM25 is an improved version that accounts for document length and term saturation. Both match exact terms — fast and precise but blind to meaning.
# BM25 with rank-bm25 library
from rank_bm25 import BM25Okapi
corpus = ["python web framework", "django rest api", "flask application"]
tokenized = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized)
query = "python api framework"
scores = bm25.get_scores(query.split())
# Returns relevance scores for each document
Vector Search: Semantic Similarity
Vector search converts text into high-dimensional numbers (embeddings) where similar meanings produce nearby vectors. A query about "automobile maintenance" will match documents about "car repair" even though they share no words.
Similarity Metrics
- Cosine similarity: Measures angle between vectors. Most common for text.
- Euclidean distance: Measures straight-line distance. Sensitive to magnitude.
- Dot product: Fast, works when vectors are normalized.