Module 13 of 16

Deploying RAG Systems

Dockerizing AI systems, Kubernetes for AI, GPU infrastructure, and CI/CD for AI applications

3.5 hours2 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Containerize RAG systems with Docker
Deploy on Kubernetes with proper resource management
Configure GPU inference for embedding models
Build CI/CD pipelines for AI applications

Deploying RAG is different from deploying a typical web app. You have CPU-bound services (API), GPU-bound services (embeddings), stateful services (vector DB), and external API dependencies (LLM). Each needs different scaling and resource strategies.

Dockerizing RAG

# Multi-stage Dockerfile for RAG API
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Kubernetes for AI

RAG API pods scale horizontally with HPA (CPU-based). Embedding service pods may need GPU nodes (or use CPU with batching). Qdrant runs as a StatefulSet with persistent volumes. Redis runs as a deployment for caching.

CI/CD for AI

AI CI/CD adds evaluation gates: run retrieval quality tests, hallucination detection, and latency benchmarks before deploying. A model or chunking change that degrades quality should be blocked automatically.

Key terms

Vocabulary used in this module

HPA

Horizontal Pod Autoscaler - scales pods based on metrics

StatefulSet

Kubernetes resource for stateful applications with persistent storage

Quality Gate

CI/CD check that blocks deployment if quality metrics degrade

Labs

Hands-on labs

40 minAdvanced

Deploy RAG on Kubernetes

Deploy the full RAG stack on a Kind cluster.

Build Docker images for RAG API and embedding service
Deploy Qdrant StatefulSet and Redis on Kubernetes
Deploy RAG API with HPA
Test end-to-end with port-forward

View lab on GitHub

30 minAdvanced

CI/CD with Quality Gates

Build a pipeline that blocks deploys on quality regression.

Add retrieval quality tests to CI
Run hallucination detection on a test set
Set quality thresholds (block if precision < 0.8)
Deploy only if all quality gates pass

View lab on GitHub

Recap

Key takeaways

RAG has mixed workloads: CPU (API), GPU (embeddings), stateful (vector DB)
Scale RAG API with HPA on request rate, embedding service on queue depth
Qdrant needs persistent storage - use StatefulSet, not Deployment
AI CI/CD must include quality gates - not just tests, but evaluation metrics
Multi-stage Docker builds keep production images small and secure

Related resources

Deploying RAG Systems

Learning objectives

Dockerizing RAG

Kubernetes for AI

CI/CD for AI

Vocabulary used in this module

HPA

StatefulSet

Quality Gate

Hands-on labs

Deploy RAG on Kubernetes

CI/CD with Quality Gates

Key takeaways

Keep learning across CodersSecret

Related guides

Cheatsheets

Interactive labs

Glossary terms