Module 5: Document Processing & Chunking
Chunking strategies, data cleaning, metadata enrichment, and building ingestion pipelines
3.5 hours. 2 hands-on labs. Free course module.
Learning Objectives
- Design chunking strategies for different document types
- Build robust document ingestion pipelines
- Implement metadata enrichment for better retrieval
- Handle PDFs, HTML, Markdown, and structured data
Why This Matters
The ingestion pipeline determines what your RAG system can find. Bad chunking means bad retrieval means bad answers. Most production RAG failures trace back to poor document processing, not poor models.
Lesson Content
RAG quality depends on what you feed it. Garbage in, garbage out applies more to RAG than almost any other system. This module teaches you to build robust document ingestion pipelines that clean, chunk, enrich, and embed your data for optimal retrieval.
Chunking Strategies
Documents must be split into chunks before embedding. Chunk size dramatically affects retrieval quality:
# Fixed-size chunking (simple, often sufficient)
def chunk_fixed(text, chunk_size=500, overlap=50):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunks
# Semantic chunking (split on natural boundaries)
def chunk_semantic(text):
import re
sections = re.split(r'\n## |\n### |\n\n', text)
return [s.strip() for s in sections if len(s.strip()) > 50]
# Recursive chunking (try large splits first, then smaller)
# Split on paragraphs -> sentences -> words
# Keep chunks under max_size while preserving meaning
Metadata Enrichment
Every chunk should carry metadata: source document title, section heading, page number, date, author, category. This metadata enables filtering at search time — "find relevant chunks from engineering docs written in 2026."
Handling Different Document Types
- PDF: Use pdfplumber or PyMuPDF for text extraction. Handle tables and images separately.
- HTML: Strip tags, preserve structure (headings become metadata).
- Markdown: Split on headers for natural semantic boundaries.
- Code: Chunk by function/class, include docstrings and signatures.
Common Mistakes
- Using one chunk size for all document types
- Not adding metadata to chunks (cannot filter by source, date, category)
- Chunking code the same way as prose (functions should be complete units)
- Not handling PDF tables and images (silently lost content)
Key Terms
- Chunking
- Splitting documents into smaller pieces for embedding and retrieval
- Overlap
- Shared text between adjacent chunks to prevent splitting mid-sentence
- Metadata
- Structured data attached to chunks (title, source, date, category)
Hands-On Labs
-
Build a Document Ingestion Pipeline
Process multiple document types into chunked, embedded vectors.
35 min - Intermediate
- Parse PDFs, Markdown, and HTML documents
- Implement fixed-size and semantic chunking
- Enrich chunks with metadata (title, source, section)
- Embed and store in Qdrant
-
Compare Chunking Strategies
Measure the impact of chunk size on retrieval quality.
30 min - Intermediate
- Chunk the same corpus with 3 different strategies
- Run identical queries against each
- Measure precision and recall at different chunk sizes
- Document the optimal strategy for your data
Key Takeaways
- Chunk size is the most impactful RAG parameter — too small loses context, too large dilutes relevance
- Sweet spot: 200-500 tokens with 10-20% overlap
- Metadata enrichment enables filtering — critical for multi-tenant and large corpora
- Different document types need different parsing strategies
- Build ingestion as a pipeline: parse -> clean -> chunk -> enrich -> embed -> store