Module 1 of 16

Introduction to AI & RAG Systems

LLM fundamentals, hallucinations, and why retrieval-augmented generation changes everything

3 hours2 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Understand how LLMs work at a high level
Learn about tokens, context windows, and their limitations
Understand why LLMs hallucinate and how RAG solves it
Compare vanilla LLM vs RAG responses

Large Language Models generate text by predicting the next token. They are remarkably capable but fundamentally limited: they can only draw from their training data, which is static, potentially outdated, and lacks your domain-specific knowledge.

Retrieval-Augmented Generation (RAG) solves this by retrieving relevant documents at query time and injecting them into the prompt. The model reads your data before answering - grounding its response in facts rather than patterns.

How LLMs Work (The 5-Minute Version)

An LLM is a neural network trained on massive text corpora. Given a sequence of tokens, it predicts the most likely next token. It does not "know" things - it has learned statistical patterns of how words relate. This is why it can write fluent text but also confidently state falsehoods.

Tokens and Context Windows

Text is split into tokens (roughly 3-4 characters each). Each model has a context window - the maximum tokens it can process in one request. GPT-4o has 128K tokens. Claude has 200K. Everything you send (system prompt, conversation history, retrieved documents, your question) must fit within this window.

Why LLMs Hallucinate

Knowledge cutoff: Training data is frozen at a point in time
No access to private data: The model does not know about your company docs
Statistical generation: It generates plausible-sounding text, not verified facts
No source grounding: Without retrieval, there is nothing to cite

What RAG Changes

RAG adds a retrieval step before generation. When a user asks a question, the system searches a vector database for relevant document chunks, injects them into the prompt as context, and the model generates an answer grounded in those documents. The result: accurate, citable, domain-specific answers instead of hallucinated guesses.

Types of RAG Systems

Naive RAG: Simple retrieve-and-generate. Adequate for demos, fragile in production.
Advanced RAG: Hybrid search, reranking, query transformation. Production-viable.
Agentic RAG: AI agents that decide what to retrieve, when, and how. Multi-step reasoning.

Real world

Where this shows up

Customer support bots answering from product documentation
Legal AI searching case law and contracts
Medical AI retrieving clinical guidelines
Code assistants searching internal repositories
Enterprise search across thousands of documents

Common mistakes

What usually breaks

Building a chatbot without RAG and hoping the LLM knows your domain
Stuffing the entire document into the prompt instead of retrieving relevant chunks
Ignoring context window limits - overfilling the prompt degrades quality
Not evaluating retrieval quality - bad retrieval means bad answers regardless of the model

Key terms

Vocabulary used in this module

LLM

Large Language Model - neural network trained on text to predict tokens

RAG

Retrieval-Augmented Generation - retrieve relevant docs before generating

Hallucination

When an LLM generates plausible but factually incorrect information

Context Window

Maximum tokens an LLM can process in one request

Token

Smallest unit of text processed by an LLM (~3-4 characters)

Labs

Hands-on labs

20 minBeginner

Run Your First LLM Application

Build a simple Python app that calls an LLM API.

Install the Anthropic Python SDK
Send a basic prompt to Claude
Observe the response and token usage
Ask a question about recent events and observe hallucination

View lab on GitHub

25 minBeginner

Compare Vanilla LLM vs RAG

See the difference RAG makes on answer quality.

Ask the LLM a domain-specific question (without context)
Provide the same question with relevant document context
Compare accuracy, citations, and confidence
Discuss when RAG is necessary vs when vanilla LLM suffices

View lab on GitHub

Recap

Key takeaways

LLMs predict tokens based on training data - they do not know facts
Hallucinations happen because the model generates plausible text without verification
RAG retrieves relevant documents and injects them into the prompt before generation
Context windows limit how much data you can include - retrieval selects the most relevant
Three RAG levels: naive (demo), advanced (production), agentic (autonomous)

Related resources