Module 13 of 16

Deploying RAG Systems

Dockerizing AI systems, Kubernetes for AI, GPU infrastructure, and CI/CD for AI applications

3.5 hours2 labsFree

Start here

Learning objectives

  • Containerize RAG systems with Docker
  • Deploy on Kubernetes with proper resource management
  • Configure GPU inference for embedding models
  • Build CI/CD pipelines for AI applications
RAG DEPLOYMENT ARCHITECTUREIngressRAG API (FastAPI)CPU pods, HPAEmbedding SvcGPU podsQdrantStatefulSetRediscacheLLM APIexternalRAG API: CPU HPA (scale on requests). Embedding: GPU (scale on queue depth). Qdrant: StatefulSet (persistent).Redis caches embeddings + answers. LLM API is external (no self-hosting needed).

Deploying RAG is different from deploying a typical web app. You have CPU-bound services (API), GPU-bound services (embeddings), stateful services (vector DB), and external API dependencies (LLM). Each needs different scaling and resource strategies.

Dockerizing RAG

# Multi-stage Dockerfile for RAG API
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Kubernetes for AI

RAG API pods scale horizontally with HPA (CPU-based). Embedding service pods may need GPU nodes (or use CPU with batching). Qdrant runs as a StatefulSet with persistent volumes. Redis runs as a deployment for caching.

CI/CD for AI

AI CI/CD adds evaluation gates: run retrieval quality tests, hallucination detection, and latency benchmarks before deploying. A model or chunking change that degrades quality should be blocked automatically.

Key terms

Vocabulary used in this module

HPA

Horizontal Pod Autoscaler — scales pods based on metrics

StatefulSet

Kubernetes resource for stateful applications with persistent storage

Quality Gate

CI/CD check that blocks deployment if quality metrics degrade

Labs

Hands-on labs

40 minAdvanced

Deploy RAG on Kubernetes

Deploy the full RAG stack on a Kind cluster.

  1. Build Docker images for RAG API and embedding service
  2. Deploy Qdrant StatefulSet and Redis on Kubernetes
  3. Deploy RAG API with HPA
  4. Test end-to-end with port-forward
View lab on GitHub
30 minAdvanced

CI/CD with Quality Gates

Build a pipeline that blocks deploys on quality regression.

  1. Add retrieval quality tests to CI
  2. Run hallucination detection on a test set
  3. Set quality thresholds (block if precision < 0.8)
  4. Deploy only if all quality gates pass
View lab on GitHub

Recap

Key takeaways

  • RAG has mixed workloads: CPU (API), GPU (embeddings), stateful (vector DB)
  • Scale RAG API with HPA on request rate, embedding service on queue depth
  • Qdrant needs persistent storage — use StatefulSet, not Deployment
  • AI CI/CD must include quality gates — not just tests, but evaluation metrics
  • Multi-stage Docker builds keep production images small and secure

Related resources

Keep learning across CodersSecret