Module 10 of 16

RAG Evaluation & Quality Engineering

Hallucination detection, retrieval metrics, groundedness scoring, and evaluation frameworks

3.5 hours2 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Measure retrieval quality with precision and recall
Detect and score hallucinations in generated answers
Build automated evaluation pipelines
Design continuous quality monitoring for production RAG

How do you know your RAG system is working well? "It seems good" is not engineering. This module teaches you to measure, evaluate, and continuously monitor RAG quality with metrics and automated pipelines.

Retrieval Metrics

Precision@K: Of the K retrieved chunks, how many were relevant?
Recall@K: Of all relevant chunks, how many were retrieved?
MRR: Mean Reciprocal Rank - how high is the first relevant result?
NDCG: Normalized Discounted Cumulative Gain - relevance weighted by position

Groundedness Scoring

Is the generated answer actually supported by the retrieved context? A groundedness score measures what percentage of claims in the answer can be traced to the provided documents. Claims not in the context are potential hallucinations.

Hallucination Detection

# LLM-as-judge for hallucination detection
def detect_hallucinations(context: str, answer: str) -> dict:
    response = claude.messages.create(
        model="claude-sonnet-4-6",
        system="""You are a hallucination detector. Given context and an answer,
identify every claim in the answer. For each claim, determine if it is
SUPPORTED by the context or NOT SUPPORTED. Return a JSON list.""",
        messages=[{"role": "user", "content": f"Context: {context}\n\nAnswer: {answer}"}],
    )
    return parse_json(response.content[0].text)

Continuous Evaluation

Run evaluation on every change: new embedding model, new chunking strategy, new retrieval method. Automate in CI/CD. Track metrics over time. Set quality thresholds that block deployment if retrieval quality degrades.

Common mistakes

What usually breaks

Only evaluating end-to-end (cannot tell if retrieval or generation is the problem)
No test dataset (no way to measure improvement objectively)
Not automating evaluation (manual review does not scale)
Ignoring hallucination detection (users lose trust after one wrong answer)

Key terms

Vocabulary used in this module

Precision@K

Fraction of retrieved chunks that are relevant

Recall@K

Fraction of relevant chunks that were retrieved

Groundedness

Degree to which generated answer is supported by retrieved context

LLM-as-Judge

Using an LLM to evaluate another LLM output

Labs

Hands-on labs

30 minIntermediate

Evaluate Retrieval Quality

Measure precision, recall, and MRR on a test dataset.

Create a test dataset with queries and expected relevant documents
Run retrieval and compute precision@5, recall@5, MRR
Compare metrics across different retrieval strategies
Identify queries where retrieval fails

View lab on GitHub

35 minAdvanced

Build a Hallucination Detection Pipeline

Detect and score hallucinations automatically.

Implement LLM-as-judge for groundedness scoring
Run on 50 query-answer pairs
Build a Grafana dashboard for quality metrics
Set up alerts for quality threshold violations

View lab on GitHub

Recap

Key takeaways

Evaluate retrieval and generation SEPARATELY - bad retrieval causes bad answers
Precision@K and Recall@K are the most important retrieval metrics
Groundedness scoring detects hallucinations by checking claims against context
LLM-as-judge is practical for automated evaluation at scale
Run evaluation on every change - treat quality as a CI/CD gate

Related resources