RAG quality depends on what you feed it. Garbage in, garbage out applies more to RAG than almost any other system. This module teaches you to build robust document ingestion pipelines that clean, chunk, enrich, and embed your data for optimal retrieval.
Chunking Strategies
Documents must be split into chunks before embedding. Chunk size dramatically affects retrieval quality:
# Fixed-size chunking (simple, often sufficient)
def chunk_fixed(text, chunk_size=500, overlap=50):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunks
# Semantic chunking (split on natural boundaries)
def chunk_semantic(text):
import re
sections = re.split(r'\n## |\n### |\n\n', text)
return [s.strip() for s in sections if len(s.strip()) > 50]
# Recursive chunking (try large splits first, then smaller)
# Split on paragraphs -> sentences -> words
# Keep chunks under max_size while preserving meaning
Metadata Enrichment
Every chunk should carry metadata: source document title, section heading, page number, date, author, category. This metadata enables filtering at search time — "find relevant chunks from engineering docs written in 2026."
Handling Different Document Types
- PDF: Use pdfplumber or PyMuPDF for text extraction. Handle tables and images separately.
- HTML: Strip tags, preserve structure (headings become metadata).
- Markdown: Split on headers for natural semantic boundaries.
- Code: Chunk by function/class, include docstrings and signatures.