Module 10 of 16

RAG Evaluation & Quality Engineering

Hallucination detection, retrieval metrics, groundedness scoring, and evaluation frameworks

3.5 hours2 labsFree

Start here

Learning objectives

  • Measure retrieval quality with precision and recall
  • Detect and score hallucinations in generated answers
  • Build automated evaluation pipelines
  • Design continuous quality monitoring for production RAG
RAG EVALUATION FRAMEWORKRetrieval Qualityprecision, recall, MRRGroundednessis answer in context?Answer Qualityrelevance, completenessHallucination Scoreclaims not in contextEvaluate retrieval AND generation separately. Bad retrieval causes bad answers.Automate evaluation in CI/CD. Run on every model/retrieval change.

How do you know your RAG system is working well? "It seems good" is not engineering. This module teaches you to measure, evaluate, and continuously monitor RAG quality with metrics and automated pipelines.

Retrieval Metrics

  • Precision@K: Of the K retrieved chunks, how many were relevant?
  • Recall@K: Of all relevant chunks, how many were retrieved?
  • MRR: Mean Reciprocal Rank — how high is the first relevant result?
  • NDCG: Normalized Discounted Cumulative Gain — relevance weighted by position

Groundedness Scoring

Is the generated answer actually supported by the retrieved context? A groundedness score measures what percentage of claims in the answer can be traced to the provided documents. Claims not in the context are potential hallucinations.

Hallucination Detection

# LLM-as-judge for hallucination detection
def detect_hallucinations(context: str, answer: str) -> dict:
    response = claude.messages.create(
        model="claude-sonnet-4-6",
        system="""You are a hallucination detector. Given context and an answer,
identify every claim in the answer. For each claim, determine if it is
SUPPORTED by the context or NOT SUPPORTED. Return a JSON list.""",
        messages=[{"role": "user", "content": f"Context: {context}\n\nAnswer: {answer}"}],
    )
    return parse_json(response.content[0].text)

Continuous Evaluation

Run evaluation on every change: new embedding model, new chunking strategy, new retrieval method. Automate in CI/CD. Track metrics over time. Set quality thresholds that block deployment if retrieval quality degrades.

Common mistakes

What usually breaks

  • Only evaluating end-to-end (cannot tell if retrieval or generation is the problem)
  • No test dataset (no way to measure improvement objectively)
  • Not automating evaluation (manual review does not scale)
  • Ignoring hallucination detection (users lose trust after one wrong answer)

Key terms

Vocabulary used in this module

Precision@K

Fraction of retrieved chunks that are relevant

Recall@K

Fraction of relevant chunks that were retrieved

Groundedness

Degree to which generated answer is supported by retrieved context

LLM-as-Judge

Using an LLM to evaluate another LLM output

Labs

Hands-on labs

30 minIntermediate

Evaluate Retrieval Quality

Measure precision, recall, and MRR on a test dataset.

  1. Create a test dataset with queries and expected relevant documents
  2. Run retrieval and compute precision@5, recall@5, MRR
  3. Compare metrics across different retrieval strategies
  4. Identify queries where retrieval fails
View lab on GitHub
35 minAdvanced

Build a Hallucination Detection Pipeline

Detect and score hallucinations automatically.

  1. Implement LLM-as-judge for groundedness scoring
  2. Run on 50 query-answer pairs
  3. Build a Grafana dashboard for quality metrics
  4. Set up alerts for quality threshold violations
View lab on GitHub

Recap

Key takeaways

  • Evaluate retrieval and generation SEPARATELY — bad retrieval causes bad answers
  • Precision@K and Recall@K are the most important retrieval metrics
  • Groundedness scoring detects hallucinations by checking claims against context
  • LLM-as-judge is practical for automated evaluation at scale
  • Run evaluation on every change — treat quality as a CI/CD gate

Related resources

Keep learning across CodersSecret