Module 10: RAG Evaluation & Quality Engineering
Hallucination detection, retrieval metrics, groundedness scoring, and evaluation frameworks
3.5 hours. 2 hands-on labs. Free course module.
Learning Objectives
- Measure retrieval quality with precision and recall
- Detect and score hallucinations in generated answers
- Build automated evaluation pipelines
- Design continuous quality monitoring for production RAG
Why This Matters
Without evaluation, you do not know if your RAG system is improving or degrading. Every change — new model, new chunking, new retrieval — needs measurement. This module gives you the framework to quantify and continuously monitor RAG quality.
Lesson Content
How do you know your RAG system is working well? "It seems good" is not engineering. This module teaches you to measure, evaluate, and continuously monitor RAG quality with metrics and automated pipelines.
Retrieval Metrics
- Precision@K: Of the K retrieved chunks, how many were relevant?
- Recall@K: Of all relevant chunks, how many were retrieved?
- MRR: Mean Reciprocal Rank — how high is the first relevant result?
- NDCG: Normalized Discounted Cumulative Gain — relevance weighted by position
Groundedness Scoring
Is the generated answer actually supported by the retrieved context? A groundedness score measures what percentage of claims in the answer can be traced to the provided documents. Claims not in the context are potential hallucinations.
Hallucination Detection
# LLM-as-judge for hallucination detection
def detect_hallucinations(context: str, answer: str) -> dict:
response = claude.messages.create(
model="claude-sonnet-4-6",
system="""You are a hallucination detector. Given context and an answer,
identify every claim in the answer. For each claim, determine if it is
SUPPORTED by the context or NOT SUPPORTED. Return a JSON list.""",
messages=[{"role": "user", "content": f"Context: {context}\n\nAnswer: {answer}"}],
)
return parse_json(response.content[0].text)
Continuous Evaluation
Run evaluation on every change: new embedding model, new chunking strategy, new retrieval method. Automate in CI/CD. Track metrics over time. Set quality thresholds that block deployment if retrieval quality degrades.
Common Mistakes
- Only evaluating end-to-end (cannot tell if retrieval or generation is the problem)
- No test dataset (no way to measure improvement objectively)
- Not automating evaluation (manual review does not scale)
- Ignoring hallucination detection (users lose trust after one wrong answer)
Key Terms
- Precision@K
- Fraction of retrieved chunks that are relevant
- Recall@K
- Fraction of relevant chunks that were retrieved
- Groundedness
- Degree to which generated answer is supported by retrieved context
- LLM-as-Judge
- Using an LLM to evaluate another LLM output
Hands-On Labs
-
Evaluate Retrieval Quality
Measure precision, recall, and MRR on a test dataset.
30 min - Intermediate
- Create a test dataset with queries and expected relevant documents
- Run retrieval and compute precision@5, recall@5, MRR
- Compare metrics across different retrieval strategies
- Identify queries where retrieval fails
-
Build a Hallucination Detection Pipeline
Detect and score hallucinations automatically.
35 min - Advanced
- Implement LLM-as-judge for groundedness scoring
- Run on 50 query-answer pairs
- Build a Grafana dashboard for quality metrics
- Set up alerts for quality threshold violations
Key Takeaways
- Evaluate retrieval and generation SEPARATELY — bad retrieval causes bad answers
- Precision@K and Recall@K are the most important retrieval metrics
- Groundedness scoring detects hallucinations by checking claims against context
- LLM-as-judge is practical for automated evaluation at scale
- Run evaluation on every change — treat quality as a CI/CD gate