Module 6: Scalability Engineering
Horizontal scaling, autoscaling, caching, CDNs, rate limiting — how production systems handle 10x and 100x traffic without 10x and 100x cost.
4 hours. 3 hands-on labs. Free course module.
Learning Objectives
- Design stateless services that scale horizontally without coordination
- Pick the right caching strategy (cache-aside, write-through, write-back) for the workload
- Configure Kubernetes HPA, VPA, and Cluster Autoscaler so they actually work
- Implement distributed rate limiting that survives multi-region
- Identify the scalability bottleneck before it becomes the outage
Why This Matters
Scalability engineering separates the engineers who can ship a system that works at 1k RPS from the ones who can ship a system that works at 1M RPS. Most architectures hit a single bottleneck early; the engineering skill is identifying that bottleneck before it bites and moving it before users feel it. Once you internalise the “every system has a bottleneck” mindset, you stop being surprised when the database connection pool exhausts under load you thought was easy.
Lesson Content
Scalability is not adding more machines. Scalability is removing the contention points that prevent more machines from helping. Every system has a bottleneck; the question is whether the next 10x of load hits a bottleneck you have already moved or one that is still in the way.
Horizontal vs Vertical Scaling
Vertical scaling (bigger machines) hits hard limits and risks single points of failure. Horizontal scaling (more machines) is the path to real scale, but only works if your service is stateless or partitions correctly.
Stateless services are the foundation. Stateless means: any replica can serve any request. If you can swap one pod for another at any time without state migration, you can scale linearly. Common state-leaking patterns to avoid:
- Local file caches that differ across replicas (move to Redis or shared filesystem).
- Sticky sessions on the load balancer (use a session store like Redis instead).
- In-process queues that hold work (move to Kafka/SQS).
- Per-replica scheduled jobs (use a leader-elected singleton or distributed cron).
Caching as a Scaling Lever
Caching multiplies effective capacity. The Caching Strategies guide covers this in depth. The summary:
- Cache-aside: app checks cache, falls back to DB, populates cache. The default for read-heavy workloads.
- Write-through: writes go to cache + DB synchronously. The cache is always fresh; writes are slower.
- Write-back: writes go to cache; cache flushes to DB asynchronously. Fast writes; data loss window.
- Read-through: cache itself loads from DB on miss. Simpler app code; coupled cache and DB.
Multi-layer caching is the production reality: browser → CDN → L7 cache → in-process → Redis. Each layer has different invalidation cost and different blast radius.
CDN — Caching at the Edge
CDNs (Cloudflare, Fastly, Akamai, CloudFront) cache content at hundreds of edge POPs close to users. The contract between origin and edge is the Cache-Control header. public, max-age=3600, stale-while-revalidate=86400 tells the CDN: serve from cache for an hour, serve stale for a day while refreshing in the background.
Modern CDNs are also where you put: WAF, edge auth, geo routing, A/B test branching, and increasingly compute (Cloudflare Workers, Lambda@Edge). The edge is where the cheapest scaling lives.
Autoscaling on Kubernetes
Three layers of autoscaling, each independent:
- HPA (Horizontal Pod Autoscaler): scale pod replicas based on CPU, memory, or custom metrics (RPS, latency, queue depth). Default scale-up is fast, scale-down conservative to avoid flapping.
- VPA (Vertical Pod Autoscaler): rightsize resource requests over time. Useful for batch and unpredictable workloads; clashes with HPA on the same metrics.
- Cluster Autoscaler / Karpenter: add nodes when pods cannot schedule due to resource shortage; remove underutilised nodes. Karpenter is the modern AWS-native replacement, faster and more flexible than Cluster Autoscaler.
The classic mistake: HPA on CPU when the bottleneck is connection pool, database, or downstream RPC. Always scale on the metric closest to user latency — often p99 latency or RPS, not CPU.
Distributed Rate Limiting
The Rate Limiting Algorithms guide covers token bucket, sliding window, distributed Redis-Lua patterns, and adaptive rate limiting. Three production rules:
- Layer rate limits: CDN volumetric, gateway per-API-key, application per-user-action.
- Choose fail-open or fail-closed deliberately when the rate-limit service is unavailable.
- Authentication endpoints get stricter limits than read endpoints.
Identifying the Bottleneck
Every system has a current bottleneck. The skill is identifying it before the user does. Common bottlenecks in order of frequency:
- Database connection pool exhaustion (because pool size < concurrent demand).
- Single-shard hot key in Redis or Cassandra.
- Synchronous external API call with no caching.
- Disk I/O on a single node (Raft fsync, database writes).
- Single-threaded code path in an otherwise concurrent service.
The diagnostic flow: load test until something gives. Where is CPU? Where is memory? Where is the queue depth growing? Where is latency climbing first? The answer points at the bottleneck.
Cache Hierarchy in Practice
Distributed Rate Limiter Architecture
Self-Check Quiz
- HPA scales on CPU. Your CPU is at 30%. Your service is throttled. What gives? (Answer: HPA is scaling on the wrong metric. Real bottleneck is probably connection pool, downstream RPC, or DB. Scale on the metric closest to user latency — RPS or p99 latency.)
- Cache hit rate dropped from 95% to 60% overnight. What three things do you check? (Answer: recent deploy that changed key shape; eviction rate spike from memory pressure; downstream errors causing skipped writes.)
- You add a CDN to a site already using Redis caching. Where do invalidations get hardest? (Answer: between layers. CDN may serve stale even after Redis is invalidated. Use surrogate keys or short TTLs at the CDN.)
- Karpenter aggressively scales nodes down at night. The next morning, traffic spikes and pods take 3 minutes to schedule. What do you change? (Answer: warm pool / over-provisioning, or scale-down deferral. Karpenter is fast at scale-up but cold-start latency on a fresh node still bites.)
For deeper caching patterns including invalidation flows and multi-region cache architecture, read the Caching Strategies guide. For rate-limiter implementation specifics see the Rate Limiting Algorithms guide. The Kubernetes cheatsheet covers HPA/VPA/Karpenter operational patterns.
Real-World Use Cases
- AWS DynamoDB's burst capacity is a literal token-bucket implementation visible to users.
- Cloudflare absorbs trillions of requests per day at the edge with a layered cache that serves most reads before any origin is involved.
- Stripe enforces per-API-key rate limits with token-bucket counters in Redis Lua scripts.
- Netflix uses adaptive concurrency limits (open-source library) to dynamically size connection pools based on observed latency.
Production Notes
- Profile workloads BEFORE setting resource requests. Most workloads request 2-3x what they use; right-sizing is direct cost savings.
- Scale on the metric closest to user latency, not CPU. CPU at 30% with throttled latency means CPU is not your bottleneck.
- For Karpenter on AWS, set node consolidation to be aggressive but combined with PodDisruptionBudgets so the consolidation does not cause outages.
Common Mistakes
- Setting HPA on CPU when the database connection pool is the actual bottleneck.
- Caching everything by default; sometimes the database is fast enough and the cache is just extra failure surface.
- Cluster Autoscaler with no Pod Disruption Budgets; nodes scale down and take working pods with them.
Security Risks to Watch
- Cache poisoning via unkeyed headers (Host, Vary mishandling) lets one attacker affect many users.
- Multi-tenant cache without tenant_id in the key leaks data between customers; a known SOC2 incident class.
- Rate-limit bypass via X-Forwarded-For spoofing when the origin trusts the wrong header.
- CDN-cached responses that should never be cached (auth-bearing, per-user) are a recurring breach pattern (web-cache deception).
Design Tradeoffs
HPA on CPU
Pros
- Simple, default
- Works well for CPU-bound workloads
Cons
- Wrong signal for I/O-bound workloads
- Lag between CPU spike and request latency
HPA on RPS / queue depth (custom metrics)
Pros
- Scales on the actual load signal
- Faster reaction
Cons
- Requires Prometheus Adapter
- More tuning
KEDA event-driven scaling
Pros
- Scales on Kafka lag, queue depth, etc.
- Scale to zero when idle
Cons
- Extra component to operate
- Cold-start tax on scale-up
Production Alternatives
- Kubernetes HPA + Karpenter: Cloud-native autoscaling stack; the default on AWS.
- KEDA event-driven autoscaling: Scale on Kafka lag, queue depth, custom metrics; can scale to zero.
- Cluster Autoscaler: Pre-Karpenter node autoscaler; still useful on GCP/Azure.
- Redis Cluster + Cluster Mode Enabled: Standard distributed cache; sharded by hash slot.
- CDN-first architecture (Cloudflare Workers, Lambda@Edge): Move logic to the edge; lowest latency, lowest cost at scale.
Think Like an Engineer
- Identify your bottleneck before scaling. Throwing replicas at a connection-pool problem makes it worse, not better.
- For every cache, define the freshness contract upfront. “5 minutes stale is fine” vs “must reflect the latest write” drives the entire invalidation strategy.
- Capacity planning is not a one-time exercise. Workload behaviour drifts; review monthly.
Production Story
A consumer team kept scaling their service horizontally as traffic grew. At 50 replicas, p99 latency suddenly spiked to 5 seconds and would not come down. Investigation showed the bottleneck was a 100-connection PostgreSQL pool shared across all 50 replicas; each replica fought for connections. The fix was a connection-pooler (PgBouncer) and per-replica pool limits aligned to the global ceiling. Latency returned to 30ms p99. The lesson: scaling pushes the bottleneck downstream; identify the bottleneck before scaling.
Key Terms
- Stateless service
- A service where any replica can serve any request without local state; foundation of horizontal scaling.
- HPA
- Kubernetes Horizontal Pod Autoscaler; scales replicas based on metrics.
- Cluster Autoscaler / Karpenter
- Kubernetes node autoscalers; add/remove nodes based on pending pods.
- Thundering herd
- Failure mode when many concurrent requests miss the cache and overwhelm the origin.
- Cache stampede
- Same as thundering herd; many concurrent recomputes of the same expired cache key.
Hands-On Labs
-
Lab 6.1 — HPA on Custom Metrics
Configure HPA based on RPS or queue depth via Prometheus Adapter; observe scale-up under load.
90 minutes - Intermediate
- Deploy app + Prometheus + Prometheus Adapter
- Define HPA on RPS metric
- Generate load; watch replicas scale up
- Cool down; watch scale down
-
Lab 6.2 — Cache-Aside with Stampede Protection
Implement cache-aside with per-key locking to prevent thundering herd.
60 minutes - Intermediate
- Implement naive cache-aside
- Reproduce stampede on cache expiry
- Add per-key Redis lock for recompute
- Verify single recompute under load
-
Lab 6.3 — Distributed Rate Limiter (Redis Lua)
Implement an atomic token-bucket rate limiter as a Redis Lua script; load test it.
60 minutes - Advanced
- Write Lua script for atomic token bucket update
- Hit it from many concurrent clients
- Verify the rate is enforced globally
Key Takeaways
- Stateless services are the foundation of horizontal scale — remove state-leaking patterns first
- Multi-layer caching multiplies capacity; pick a strategy per layer deliberately
- Scale on the metric closest to user latency, not on CPU when CPU is not the bottleneck
- Distributed rate limiting requires consensus or aggregation — pick the trade-off
- Every system has a bottleneck; the engineering work is moving it before it bites