-Design stateless services that scale horizontally without coordination
-Pick the right caching strategy (cache-aside, write-through, write-back) for the workload
-Configure Kubernetes HPA, VPA, and Cluster Autoscaler so they actually work
-Implement distributed rate limiting that survives multi-region
-Identify the scalability bottleneck before it becomes the outage
Before
-Vertical scaling until the box maxes out; no headroom for growth
-In-memory state on every replica; no horizontal scaling possible
-Caches added reactively after the first outage; no invalidation strategy
-HPA on CPU when bottleneck is connection pool; scaling helps until it doesn't
After
+Stateless services with shared distributed cache + database; linear horizontal scale
+Multi-layer cache hierarchy (CDN, in-process, distributed); each layer absorbs different load
+Cache invalidation via CDC events; cache and database stay coherent
+HPA on the metric closest to user latency; scaling responds to actual demand
Scalability is not adding more machines. Scalability is removing the contention points that prevent more machines from helping. Every system has a bottleneck; the question is whether the next 10x of load hits a bottleneck you have already moved or one that is still in the way.
Horizontal vs Vertical Scaling
Vertical scaling (bigger machines) hits hard limits and risks single points of failure. Horizontal scaling (more machines) is the path to real scale, but only works if your service is stateless or partitions correctly.
Stateless services are the foundation. Stateless means: any replica can serve any request. If you can swap one pod for another at any time without state migration, you can scale linearly. Common state-leaking patterns to avoid:
Local file caches that differ across replicas (move to Redis or shared filesystem).
Sticky sessions on the load balancer (use a session store like Redis instead).
In-process queues that hold work (move to Kafka/SQS).
Per-replica scheduled jobs (use a leader-elected singleton or distributed cron).
Caching as a Scaling Lever
Caching multiplies effective capacity. The Caching Strategies guide covers this in depth. The summary:
Cache-aside: app checks cache, falls back to DB, populates cache. The default for read-heavy workloads.
Write-through: writes go to cache + DB synchronously. The cache is always fresh; writes are slower.
Write-back: writes go to cache; cache flushes to DB asynchronously. Fast writes; data loss window.
Read-through: cache itself loads from DB on miss. Simpler app code; coupled cache and DB.
Multi-layer caching is the production reality: browser → CDN → L7 cache → in-process → Redis. Each layer has different invalidation cost and different blast radius.
CDN — Caching at the Edge
CDNs (Cloudflare, Fastly, Akamai, CloudFront) cache content at hundreds of edge POPs close to users. The contract between origin and edge is the Cache-Control header. public, max-age=3600, stale-while-revalidate=86400 tells the CDN: serve from cache for an hour, serve stale for a day while refreshing in the background.
Modern CDNs are also where you put: WAF, edge auth, geo routing, A/B test branching, and increasingly compute (Cloudflare Workers, Lambda@Edge). The edge is where the cheapest scaling lives.
Autoscaling on Kubernetes
Three layers of autoscaling, each independent:
HPA (Horizontal Pod Autoscaler): scale pod replicas based on CPU, memory, or custom metrics (RPS, latency, queue depth). Default scale-up is fast, scale-down conservative to avoid flapping.
VPA (Vertical Pod Autoscaler): rightsize resource requests over time. Useful for batch and unpredictable workloads; clashes with HPA on the same metrics.
Cluster Autoscaler / Karpenter: add nodes when pods cannot schedule due to resource shortage; remove underutilised nodes. Karpenter is the modern AWS-native replacement, faster and more flexible than Cluster Autoscaler.
The classic mistake: HPA on CPU when the bottleneck is connection pool, database, or downstream RPC. Always scale on the metric closest to user latency — often p99 latency or RPS, not CPU.
Distributed Rate Limiting
The Rate Limiting Algorithms guide covers token bucket, sliding window, distributed Redis-Lua patterns, and adaptive rate limiting. Three production rules:
Choose fail-open or fail-closed deliberately when the rate-limit service is unavailable.
Authentication endpoints get stricter limits than read endpoints.
Identifying the Bottleneck
Every system has a current bottleneck. The skill is identifying it before the user does. Common bottlenecks in order of frequency:
Database connection pool exhaustion (because pool size < concurrent demand).
Single-shard hot key in Redis or Cassandra.
Synchronous external API call with no caching.
Disk I/O on a single node (Raft fsync, database writes).
Single-threaded code path in an otherwise concurrent service.
The diagnostic flow: load test until something gives. Where is CPU? Where is memory? Where is the queue depth growing? Where is latency climbing first? The answer points at the bottleneck.
Cache Hierarchy in Practice
Distributed Rate Limiter Architecture
Self-Check Quiz
HPA scales on CPU. Your CPU is at 30%. Your service is throttled. What gives? (Answer: HPA is scaling on the wrong metric. Real bottleneck is probably connection pool, downstream RPC, or DB. Scale on the metric closest to user latency — RPS or p99 latency.)
Cache hit rate dropped from 95% to 60% overnight. What three things do you check? (Answer: recent deploy that changed key shape; eviction rate spike from memory pressure; downstream errors causing skipped writes.)
You add a CDN to a site already using Redis caching. Where do invalidations get hardest? (Answer: between layers. CDN may serve stale even after Redis is invalidated. Use surrogate keys or short TTLs at the CDN.)
Karpenter aggressively scales nodes down at night. The next morning, traffic spikes and pods take 3 minutes to schedule. What do you change? (Answer: warm pool / over-provisioning, or scale-down deferral. Karpenter is fast at scale-up but cold-start latency on a fresh node still bites.)
Move logic to the edge; lowest latency, lowest cost at scale.
Think like an engineer
Questions to answer before shipping
?Identify your bottleneck before scaling. Throwing replicas at a connection-pool problem makes it worse, not better.
?For every cache, define the freshness contract upfront. “5 minutes stale is fine” vs “must reflect the latest write” drives the entire invalidation strategy.
?Capacity planning is not a one-time exercise. Workload behaviour drifts; review monthly.
Key terms
Vocabulary used in this module
Stateless service
A service where any replica can serve any request without local state; foundation of horizontal scaling.
HPA
Kubernetes Horizontal Pod Autoscaler; scales replicas based on metrics.
Cluster Autoscaler / Karpenter
Kubernetes node autoscalers; add/remove nodes based on pending pods.
Thundering herd
Failure mode when many concurrent requests miss the cache and overwhelm the origin.
Cache stampede
Same as thundering herd; many concurrent recomputes of the same expired cache key.
Labs
Hands-on labs
1
90 minutesIntermediate
Lab 6.1 — HPA on Custom Metrics
Configure HPA based on RPS or queue depth via Prometheus Adapter; observe scale-up under load.