Module 5 of 16

Document Processing & Chunking

Chunking strategies, data cleaning, metadata enrichment, and building ingestion pipelines

3.5 hours2 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Design chunking strategies for different document types
Build robust document ingestion pipelines
Implement metadata enrichment for better retrieval
Handle PDFs, HTML, Markdown, and structured data

RAG quality depends on what you feed it. Garbage in, garbage out applies more to RAG than almost any other system. This module teaches you to build robust document ingestion pipelines that clean, chunk, enrich, and embed your data for optimal retrieval.

Chunking Strategies

Documents must be split into chunks before embedding. Chunk size dramatically affects retrieval quality:

# Fixed-size chunking (simple, often sufficient)
def chunk_fixed(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

# Semantic chunking (split on natural boundaries)
def chunk_semantic(text):
    import re
    sections = re.split(r'\n## |\n### |\n\n', text)
    return [s.strip() for s in sections if len(s.strip()) > 50]

# Recursive chunking (try large splits first, then smaller)
# Split on paragraphs -> sentences -> words
# Keep chunks under max_size while preserving meaning

Metadata Enrichment

Every chunk should carry metadata: source document title, section heading, page number, date, author, category. This metadata enables filtering at search time - "find relevant chunks from engineering docs written in 2026."

Handling Different Document Types

PDF: Use pdfplumber or PyMuPDF for text extraction. Handle tables and images separately.
HTML: Strip tags, preserve structure (headings become metadata).
Markdown: Split on headers for natural semantic boundaries.
Code: Chunk by function/class, include docstrings and signatures.

Common mistakes

What usually breaks

Using one chunk size for all document types
Not adding metadata to chunks (cannot filter by source, date, category)
Chunking code the same way as prose (functions should be complete units)
Not handling PDF tables and images (silently lost content)

Key terms

Vocabulary used in this module

Chunking

Splitting documents into smaller pieces for embedding and retrieval

Overlap

Shared text between adjacent chunks to prevent splitting mid-sentence

Metadata

Structured data attached to chunks (title, source, date, category)

Labs

Hands-on labs

35 minIntermediate

Build a Document Ingestion Pipeline

Process multiple document types into chunked, embedded vectors.

Parse PDFs, Markdown, and HTML documents
Implement fixed-size and semantic chunking
Enrich chunks with metadata (title, source, section)
Embed and store in Qdrant

View lab on GitHub

30 minIntermediate

Compare Chunking Strategies

Measure the impact of chunk size on retrieval quality.

Chunk the same corpus with 3 different strategies
Run identical queries against each
Measure precision and recall at different chunk sizes
Document the optimal strategy for your data

View lab on GitHub

Recap

Key takeaways

Chunk size is the most impactful RAG parameter - too small loses context, too large dilutes relevance
Sweet spot: 200-500 tokens with 10-20% overlap
Metadata enrichment enables filtering - critical for multi-tenant and large corpora
Different document types need different parsing strategies
Build ingestion as a pipeline: parse -> clean -> chunk -> enrich -> embed -> store

Related resources

Document Processing & Chunking

Learning objectives

Chunking Strategies

Metadata Enrichment

Handling Different Document Types

What usually breaks

Vocabulary used in this module

Chunking

Overlap

Metadata

Hands-on labs

Build a Document Ingestion Pipeline

Compare Chunking Strategies

Key takeaways

Keep learning across CodersSecret

Related guides

Cheatsheets

Interactive labs

Glossary terms