Module 1: Introduction to AI & RAG Systems
LLM fundamentals, hallucinations, and why retrieval-augmented generation changes everything
3 hours. 2 hands-on labs. Free course module.
Learning Objectives
- Understand how LLMs work at a high level
- Learn about tokens, context windows, and their limitations
- Understand why LLMs hallucinate and how RAG solves it
- Compare vanilla LLM vs RAG responses
Why This Matters
Every AI application that needs to answer questions about specific data — company docs, product manuals, legal contracts, medical records — needs RAG. Without it, your chatbot confidently makes things up. With it, your chatbot cites real sources. This is the foundation of every production AI system.
Lesson Content
Large Language Models generate text by predicting the next token. They are remarkably capable but fundamentally limited: they can only draw from their training data, which is static, potentially outdated, and lacks your domain-specific knowledge.
Retrieval-Augmented Generation (RAG) solves this by retrieving relevant documents at query time and injecting them into the prompt. The model reads your data before answering — grounding its response in facts rather than patterns.
How LLMs Work (The 5-Minute Version)
An LLM is a neural network trained on massive text corpora. Given a sequence of tokens, it predicts the most likely next token. It does not "know" things — it has learned statistical patterns of how words relate. This is why it can write fluent text but also confidently state falsehoods.
Tokens and Context Windows
Text is split into tokens (roughly 3-4 characters each). Each model has a context window — the maximum tokens it can process in one request. GPT-4o has 128K tokens. Claude has 200K. Everything you send (system prompt, conversation history, retrieved documents, your question) must fit within this window.
Why LLMs Hallucinate
- Knowledge cutoff: Training data is frozen at a point in time
- No access to private data: The model does not know about your company docs
- Statistical generation: It generates plausible-sounding text, not verified facts
- No source grounding: Without retrieval, there is nothing to cite
What RAG Changes
RAG adds a retrieval step before generation. When a user asks a question, the system searches a vector database for relevant document chunks, injects them into the prompt as context, and the model generates an answer grounded in those documents. The result: accurate, citable, domain-specific answers instead of hallucinated guesses.
Types of RAG Systems
- Naive RAG: Simple retrieve-and-generate. Adequate for demos, fragile in production.
- Advanced RAG: Hybrid search, reranking, query transformation. Production-viable.
- Agentic RAG: AI agents that decide what to retrieve, when, and how. Multi-step reasoning.
Real-World Use Cases
- Customer support bots answering from product documentation
- Legal AI searching case law and contracts
- Medical AI retrieving clinical guidelines
- Code assistants searching internal repositories
- Enterprise search across thousands of documents
Common Mistakes
- Building a chatbot without RAG and hoping the LLM knows your domain
- Stuffing the entire document into the prompt instead of retrieving relevant chunks
- Ignoring context window limits — overfilling the prompt degrades quality
- Not evaluating retrieval quality — bad retrieval means bad answers regardless of the model
Career Relevance
RAG engineering is the most in-demand AI skill after prompt engineering. Every company building AI products needs engineers who can architect retrieval systems, not just chain API calls.
Key Terms
- LLM
- Large Language Model — neural network trained on text to predict tokens
- RAG
- Retrieval-Augmented Generation — retrieve relevant docs before generating
- Hallucination
- When an LLM generates plausible but factually incorrect information
- Context Window
- Maximum tokens an LLM can process in one request
- Token
- Smallest unit of text processed by an LLM (~3-4 characters)
Hands-On Labs
-
Run Your First LLM Application
Build a simple Python app that calls an LLM API.
20 min - Beginner
- Install the Anthropic Python SDK
- Send a basic prompt to Claude
- Observe the response and token usage
- Ask a question about recent events and observe hallucination
-
Compare Vanilla LLM vs RAG
See the difference RAG makes on answer quality.
25 min - Beginner
- Ask the LLM a domain-specific question (without context)
- Provide the same question with relevant document context
- Compare accuracy, citations, and confidence
- Discuss when RAG is necessary vs when vanilla LLM suffices
Key Takeaways
- LLMs predict tokens based on training data — they do not know facts
- Hallucinations happen because the model generates plausible text without verification
- RAG retrieves relevant documents and injects them into the prompt before generation
- Context windows limit how much data you can include — retrieval selects the most relevant
- Three RAG levels: naive (demo), advanced (production), agentic (autonomous)