Deploying RAG is different from deploying a typical web app. You have CPU-bound services (API), GPU-bound services (embeddings), stateful services (vector DB), and external API dependencies (LLM). Each needs different scaling and resource strategies.
Dockerizing RAG
# Multi-stage Dockerfile for RAG API
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Kubernetes for AI
RAG API pods scale horizontally with HPA (CPU-based). Embedding service pods may need GPU nodes (or use CPU with batching). Qdrant runs as a StatefulSet with persistent volumes. Redis runs as a deployment for caching.
CI/CD for AI
AI CI/CD adds evaluation gates: run retrieval quality tests, hallucination detection, and latency benchmarks before deploying. A model or chunking change that degrades quality should be blocked automatically.