Basic RAG uses single-mode retrieval. Production RAG uses hybrid search (BM25 + vectors), reranking (cross-encoder models), and query transformation. These techniques can improve retrieval quality by 20-40% — which directly translates to better answers.
Hybrid Search
Combine keyword search (BM25) with vector search, then merge results using Reciprocal Rank Fusion (RRF). BM25 catches exact terms that vector search misses (product codes, acronyms). Vectors catch meaning that BM25 misses (synonyms, paraphrases).
Reranking
Initial retrieval (BM25 + vector) is fast but coarse. A cross-encoder reranker takes the top-K results and reorders them by computing a relevance score using full cross-attention between query and document. Slower but much more accurate.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Rerank top-20 results to get top-5
pairs = [(query, doc.content) for doc in initial_results[:20]]
scores = reranker.predict(pairs)
reranked = sorted(zip(initial_results, scores), key=lambda x: -x[1])[:5]
Query Expansion
Sometimes the user query is ambiguous or too short. Query expansion generates multiple variations to improve recall: "python performance" might expand to "python performance optimization", "python speed improvement", "python profiling".
Graph RAG
Traditional RAG retrieves independent chunks. Graph RAG builds a knowledge graph of relationships between entities and concepts, enabling multi-hop reasoning: "What are the dependencies of Service A?" can follow relationship edges across the graph.