How do you know your RAG system is working well? "It seems good" is not engineering. This module teaches you to measure, evaluate, and continuously monitor RAG quality with metrics and automated pipelines.
Retrieval Metrics
- Precision@K: Of the K retrieved chunks, how many were relevant?
- Recall@K: Of all relevant chunks, how many were retrieved?
- MRR: Mean Reciprocal Rank — how high is the first relevant result?
- NDCG: Normalized Discounted Cumulative Gain — relevance weighted by position
Groundedness Scoring
Is the generated answer actually supported by the retrieved context? A groundedness score measures what percentage of claims in the answer can be traced to the provided documents. Claims not in the context are potential hallucinations.
Hallucination Detection
# LLM-as-judge for hallucination detection
def detect_hallucinations(context: str, answer: str) -> dict:
response = claude.messages.create(
model="claude-sonnet-4-6",
system="""You are a hallucination detector. Given context and an answer,
identify every claim in the answer. For each claim, determine if it is
SUPPORTED by the context or NOT SUPPORTED. Return a JSON list.""",
messages=[{"role": "user", "content": f"Context: {context}\n\nAnswer: {answer}"}],
)
return parse_json(response.content[0].text)
Continuous Evaluation
Run evaluation on every change: new embedding model, new chunking strategy, new retrieval method. Automate in CI/CD. Track metrics over time. Set quality thresholds that block deployment if retrieval quality degrades.