Large Language Models generate text by predicting the next token. They are remarkably capable but fundamentally limited: they can only draw from their training data, which is static, potentially outdated, and lacks your domain-specific knowledge.
Retrieval-Augmented Generation (RAG) solves this by retrieving relevant documents at query time and injecting them into the prompt. The model reads your data before answering — grounding its response in facts rather than patterns.
How LLMs Work (The 5-Minute Version)
An LLM is a neural network trained on massive text corpora. Given a sequence of tokens, it predicts the most likely next token. It does not "know" things — it has learned statistical patterns of how words relate. This is why it can write fluent text but also confidently state falsehoods.
Tokens and Context Windows
Text is split into tokens (roughly 3-4 characters each). Each model has a context window — the maximum tokens it can process in one request. GPT-4o has 128K tokens. Claude has 200K. Everything you send (system prompt, conversation history, retrieved documents, your question) must fit within this window.
Why LLMs Hallucinate
- Knowledge cutoff: Training data is frozen at a point in time
- No access to private data: The model does not know about your company docs
- Statistical generation: It generates plausible-sounding text, not verified facts
- No source grounding: Without retrieval, there is nothing to cite
What RAG Changes
RAG adds a retrieval step before generation. When a user asks a question, the system searches a vector database for relevant document chunks, injects them into the prompt as context, and the model generates an answer grounded in those documents. The result: accurate, citable, domain-specific answers instead of hallucinated guesses.
Types of RAG Systems
- Naive RAG: Simple retrieve-and-generate. Adequate for demos, fragile in production.
- Advanced RAG: Hybrid search, reranking, query transformation. Production-viable.
- Agentic RAG: AI agents that decide what to retrieve, when, and how. Multi-step reasoning.