Skip to main content

Module 5: Document Processing & Chunking

Chunking strategies, data cleaning, metadata enrichment, and building ingestion pipelines

3.5 hours. 2 hands-on labs. Free course module.

Learning Objectives

  • Design chunking strategies for different document types
  • Build robust document ingestion pipelines
  • Implement metadata enrichment for better retrieval
  • Handle PDFs, HTML, Markdown, and structured data

Why This Matters

The ingestion pipeline determines what your RAG system can find. Bad chunking means bad retrieval means bad answers. Most production RAG failures trace back to poor document processing, not poor models.

DOCUMENT INGESTION PIPELINERaw DocumentsPDF, HTML, MDParse + Cleanextract textChunk200-500 tokensEnrich Metadatatitle, source, dateEmbedvectorizeStorevector DBChunk size determines retrieval quality. Too small = loses context. Too large = dilutes relevance.Sweet spot: 200-500 tokens with 10-20% overlap between chunks.
Architecture diagram for Module 5: Document Processing & Chunking.

Lesson Content

RAG quality depends on what you feed it. Garbage in, garbage out applies more to RAG than almost any other system. This module teaches you to build robust document ingestion pipelines that clean, chunk, enrich, and embed your data for optimal retrieval.

Chunking Strategies

Documents must be split into chunks before embedding. Chunk size dramatically affects retrieval quality:

# Fixed-size chunking (simple, often sufficient)
def chunk_fixed(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

# Semantic chunking (split on natural boundaries)
def chunk_semantic(text):
    import re
    sections = re.split(r'\n## |\n### |\n\n', text)
    return [s.strip() for s in sections if len(s.strip()) > 50]

# Recursive chunking (try large splits first, then smaller)
# Split on paragraphs -> sentences -> words
# Keep chunks under max_size while preserving meaning

Metadata Enrichment

Every chunk should carry metadata: source document title, section heading, page number, date, author, category. This metadata enables filtering at search time — "find relevant chunks from engineering docs written in 2026."

Handling Different Document Types

  • PDF: Use pdfplumber or PyMuPDF for text extraction. Handle tables and images separately.
  • HTML: Strip tags, preserve structure (headings become metadata).
  • Markdown: Split on headers for natural semantic boundaries.
  • Code: Chunk by function/class, include docstrings and signatures.

Common Mistakes

  • Using one chunk size for all document types
  • Not adding metadata to chunks (cannot filter by source, date, category)
  • Chunking code the same way as prose (functions should be complete units)
  • Not handling PDF tables and images (silently lost content)

Key Terms

Chunking
Splitting documents into smaller pieces for embedding and retrieval
Overlap
Shared text between adjacent chunks to prevent splitting mid-sentence
Metadata
Structured data attached to chunks (title, source, date, category)

Hands-On Labs

  1. Build a Document Ingestion Pipeline

    Process multiple document types into chunked, embedded vectors.

    35 min - Intermediate

    • Parse PDFs, Markdown, and HTML documents
    • Implement fixed-size and semantic chunking
    • Enrich chunks with metadata (title, source, section)
    • Embed and store in Qdrant

    View lab files on GitHub

  2. Compare Chunking Strategies

    Measure the impact of chunk size on retrieval quality.

    30 min - Intermediate

    • Chunk the same corpus with 3 different strategies
    • Run identical queries against each
    • Measure precision and recall at different chunk sizes
    • Document the optimal strategy for your data

    View lab files on GitHub

Key Takeaways

  • Chunk size is the most impactful RAG parameter — too small loses context, too large dilutes relevance
  • Sweet spot: 200-500 tokens with 10-20% overlap
  • Metadata enrichment enables filtering — critical for multi-tenant and large corpora
  • Different document types need different parsing strategies
  • Build ingestion as a pipeline: parse -> clean -> chunk -> enrich -> embed -> store