Module 13: Deploying RAG Systems
Dockerizing AI systems, Kubernetes for AI, GPU infrastructure, and CI/CD for AI applications
3.5 hours. 2 hands-on labs. Free course module.
Learning Objectives
- Containerize RAG systems with Docker
- Deploy on Kubernetes with proper resource management
- Configure GPU inference for embedding models
- Build CI/CD pipelines for AI applications
Why This Matters
Deploying RAG to production is where most projects stall. The gap between localhost demo and Kubernetes production is enormous. This module gives you the deployment patterns that bridge that gap.
Lesson Content
Deploying RAG is different from deploying a typical web app. You have CPU-bound services (API), GPU-bound services (embeddings), stateful services (vector DB), and external API dependencies (LLM). Each needs different scaling and resource strategies.
Dockerizing RAG
# Multi-stage Dockerfile for RAG API
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Kubernetes for AI
RAG API pods scale horizontally with HPA (CPU-based). Embedding service pods may need GPU nodes (or use CPU with batching). Qdrant runs as a StatefulSet with persistent volumes. Redis runs as a deployment for caching.
CI/CD for AI
AI CI/CD adds evaluation gates: run retrieval quality tests, hallucination detection, and latency benchmarks before deploying. A model or chunking change that degrades quality should be blocked automatically.
Key Terms
- HPA
- Horizontal Pod Autoscaler — scales pods based on metrics
- StatefulSet
- Kubernetes resource for stateful applications with persistent storage
- Quality Gate
- CI/CD check that blocks deployment if quality metrics degrade
Hands-On Labs
-
Deploy RAG on Kubernetes
Deploy the full RAG stack on a Kind cluster.
40 min - Advanced
- Build Docker images for RAG API and embedding service
- Deploy Qdrant StatefulSet and Redis on Kubernetes
- Deploy RAG API with HPA
- Test end-to-end with port-forward
-
CI/CD with Quality Gates
Build a pipeline that blocks deploys on quality regression.
30 min - Advanced
- Add retrieval quality tests to CI
- Run hallucination detection on a test set
- Set quality thresholds (block if precision < 0.8)
- Deploy only if all quality gates pass
Key Takeaways
- RAG has mixed workloads: CPU (API), GPU (embeddings), stateful (vector DB)
- Scale RAG API with HPA on request rate, embedding service on queue depth
- Qdrant needs persistent storage — use StatefulSet, not Deployment
- AI CI/CD must include quality gates — not just tests, but evaluation metrics
- Multi-stage Docker builds keep production images small and secure