Skip to main content

Module 10: RAG Evaluation & Quality Engineering

Hallucination detection, retrieval metrics, groundedness scoring, and evaluation frameworks

3.5 hours. 2 hands-on labs. Free course module.

Learning Objectives

  • Measure retrieval quality with precision and recall
  • Detect and score hallucinations in generated answers
  • Build automated evaluation pipelines
  • Design continuous quality monitoring for production RAG

Why This Matters

Without evaluation, you do not know if your RAG system is improving or degrading. Every change — new model, new chunking, new retrieval — needs measurement. This module gives you the framework to quantify and continuously monitor RAG quality.

RAG EVALUATION FRAMEWORKRetrieval Qualityprecision, recall, MRRGroundednessis answer in context?Answer Qualityrelevance, completenessHallucination Scoreclaims not in contextEvaluate retrieval AND generation separately. Bad retrieval causes bad answers.Automate evaluation in CI/CD. Run on every model/retrieval change.
Architecture diagram for Module 10: RAG Evaluation & Quality Engineering.

Lesson Content

How do you know your RAG system is working well? "It seems good" is not engineering. This module teaches you to measure, evaluate, and continuously monitor RAG quality with metrics and automated pipelines.

Retrieval Metrics

  • Precision@K: Of the K retrieved chunks, how many were relevant?
  • Recall@K: Of all relevant chunks, how many were retrieved?
  • MRR: Mean Reciprocal Rank — how high is the first relevant result?
  • NDCG: Normalized Discounted Cumulative Gain — relevance weighted by position

Groundedness Scoring

Is the generated answer actually supported by the retrieved context? A groundedness score measures what percentage of claims in the answer can be traced to the provided documents. Claims not in the context are potential hallucinations.

Hallucination Detection

# LLM-as-judge for hallucination detection
def detect_hallucinations(context: str, answer: str) -> dict:
    response = claude.messages.create(
        model="claude-sonnet-4-6",
        system="""You are a hallucination detector. Given context and an answer,
identify every claim in the answer. For each claim, determine if it is
SUPPORTED by the context or NOT SUPPORTED. Return a JSON list.""",
        messages=[{"role": "user", "content": f"Context: {context}\n\nAnswer: {answer}"}],
    )
    return parse_json(response.content[0].text)

Continuous Evaluation

Run evaluation on every change: new embedding model, new chunking strategy, new retrieval method. Automate in CI/CD. Track metrics over time. Set quality thresholds that block deployment if retrieval quality degrades.

Common Mistakes

  • Only evaluating end-to-end (cannot tell if retrieval or generation is the problem)
  • No test dataset (no way to measure improvement objectively)
  • Not automating evaluation (manual review does not scale)
  • Ignoring hallucination detection (users lose trust after one wrong answer)

Key Terms

Precision@K
Fraction of retrieved chunks that are relevant
Recall@K
Fraction of relevant chunks that were retrieved
Groundedness
Degree to which generated answer is supported by retrieved context
LLM-as-Judge
Using an LLM to evaluate another LLM output

Hands-On Labs

  1. Evaluate Retrieval Quality

    Measure precision, recall, and MRR on a test dataset.

    30 min - Intermediate

    • Create a test dataset with queries and expected relevant documents
    • Run retrieval and compute precision@5, recall@5, MRR
    • Compare metrics across different retrieval strategies
    • Identify queries where retrieval fails

    View lab files on GitHub

  2. Build a Hallucination Detection Pipeline

    Detect and score hallucinations automatically.

    35 min - Advanced

    • Implement LLM-as-judge for groundedness scoring
    • Run on 50 query-answer pairs
    • Build a Grafana dashboard for quality metrics
    • Set up alerts for quality threshold violations

    View lab files on GitHub

Key Takeaways

  • Evaluate retrieval and generation SEPARATELY — bad retrieval causes bad answers
  • Precision@K and Recall@K are the most important retrieval metrics
  • Groundedness scoring detects hallucinations by checking claims against context
  • LLM-as-judge is practical for automated evaluation at scale
  • Run evaluation on every change — treat quality as a CI/CD gate