Skip to main content

Module 13: Deploying RAG Systems

Dockerizing AI systems, Kubernetes for AI, GPU infrastructure, and CI/CD for AI applications

3.5 hours. 2 hands-on labs. Free course module.

Learning Objectives

  • Containerize RAG systems with Docker
  • Deploy on Kubernetes with proper resource management
  • Configure GPU inference for embedding models
  • Build CI/CD pipelines for AI applications

Why This Matters

Deploying RAG to production is where most projects stall. The gap between localhost demo and Kubernetes production is enormous. This module gives you the deployment patterns that bridge that gap.

RAG DEPLOYMENT ARCHITECTUREIngressRAG API (FastAPI)CPU pods, HPAEmbedding SvcGPU podsQdrantStatefulSetRediscacheLLM APIexternalRAG API: CPU HPA (scale on requests). Embedding: GPU (scale on queue depth). Qdrant: StatefulSet (persistent).Redis caches embeddings + answers. LLM API is external (no self-hosting needed).
Architecture diagram for Module 13: Deploying RAG Systems.

Lesson Content

Deploying RAG is different from deploying a typical web app. You have CPU-bound services (API), GPU-bound services (embeddings), stateful services (vector DB), and external API dependencies (LLM). Each needs different scaling and resource strategies.

Dockerizing RAG

# Multi-stage Dockerfile for RAG API
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Kubernetes for AI

RAG API pods scale horizontally with HPA (CPU-based). Embedding service pods may need GPU nodes (or use CPU with batching). Qdrant runs as a StatefulSet with persistent volumes. Redis runs as a deployment for caching.

CI/CD for AI

AI CI/CD adds evaluation gates: run retrieval quality tests, hallucination detection, and latency benchmarks before deploying. A model or chunking change that degrades quality should be blocked automatically.

Key Terms

HPA
Horizontal Pod Autoscaler — scales pods based on metrics
StatefulSet
Kubernetes resource for stateful applications with persistent storage
Quality Gate
CI/CD check that blocks deployment if quality metrics degrade

Hands-On Labs

  1. Deploy RAG on Kubernetes

    Deploy the full RAG stack on a Kind cluster.

    40 min - Advanced

    • Build Docker images for RAG API and embedding service
    • Deploy Qdrant StatefulSet and Redis on Kubernetes
    • Deploy RAG API with HPA
    • Test end-to-end with port-forward

    View lab files on GitHub

  2. CI/CD with Quality Gates

    Build a pipeline that blocks deploys on quality regression.

    30 min - Advanced

    • Add retrieval quality tests to CI
    • Run hallucination detection on a test set
    • Set quality thresholds (block if precision < 0.8)
    • Deploy only if all quality gates pass

    View lab files on GitHub

Key Takeaways

  • RAG has mixed workloads: CPU (API), GPU (embeddings), stateful (vector DB)
  • Scale RAG API with HPA on request rate, embedding service on queue depth
  • Qdrant needs persistent storage — use StatefulSet, not Deployment
  • AI CI/CD must include quality gates — not just tests, but evaluation metrics
  • Multi-stage Docker builds keep production images small and secure