Module 6 of 12

Scalability Engineering

Horizontal scaling, autoscaling, caching, CDNs, rate limiting - how production systems handle 10x and 100x traffic without 10x and 100x cost.

4 hours3 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Design stateless services that scale horizontally without coordination
Pick the right caching strategy (cache-aside, write-through, write-back) for the workload
Configure Kubernetes HPA, VPA, and Cluster Autoscaler so they actually work
Implement distributed rate limiting that survives multi-region
Identify the scalability bottleneck before it becomes the outage

Before

Vertical scaling until the box maxes out; no headroom for growth
In-memory state on every replica; no horizontal scaling possible
Caches added reactively after the first outage; no invalidation strategy
HPA on CPU when bottleneck is connection pool; scaling helps until it doesn't

After

Stateless services with shared distributed cache + database; linear horizontal scale
Multi-layer cache hierarchy (CDN, in-process, distributed); each layer absorbs different load
Cache invalidation via CDC events; cache and database stay coherent
HPA on the metric closest to user latency; scaling responds to actual demand

Scalability is not adding more machines. Scalability is removing the contention points that prevent more machines from helping. Every system has a bottleneck; the question is whether the next 10x of load hits a bottleneck you have already moved or one that is still in the way.

Horizontal vs Vertical Scaling

Vertical scaling (bigger machines) hits hard limits and risks single points of failure. Horizontal scaling (more machines) is the path to real scale, but only works if your service is stateless or partitions correctly.

Stateless services are the foundation. Stateless means: any replica can serve any request. If you can swap one pod for another at any time without state migration, you can scale linearly. Common state-leaking patterns to avoid:

Local file caches that differ across replicas (move to Redis or shared filesystem).
Sticky sessions on the load balancer (use a session store like Redis instead).
In-process queues that hold work (move to Kafka/SQS).
Per-replica scheduled jobs (use a leader-elected singleton or distributed cron).

Caching as a Scaling Lever

Caching multiplies effective capacity. The Caching Strategies guide covers this in depth. The summary:

Cache-aside: app checks cache, falls back to DB, populates cache. The default for read-heavy workloads.
Write-through: writes go to cache + DB synchronously. The cache is always fresh; writes are slower.
Write-back: writes go to cache; cache flushes to DB asynchronously. Fast writes; data loss window.
Read-through: cache itself loads from DB on miss. Simpler app code; coupled cache and DB.

Multi-layer caching is the production reality: browser → CDN → L7 cache → in-process → Redis. Each layer has different invalidation cost and different blast radius.

CDN - Caching at the Edge

CDNs (Cloudflare, Fastly, Akamai, CloudFront) cache content at hundreds of edge POPs close to users. The contract between origin and edge is the Cache-Control header. public, max-age=3600, stale-while-revalidate=86400 tells the CDN: serve from cache for an hour, serve stale for a day while refreshing in the background.

Modern CDNs are also where you put: WAF, edge auth, geo routing, A/B test branching, and increasingly compute (Cloudflare Workers, Lambda@Edge). The edge is where the cheapest scaling lives.

Autoscaling on Kubernetes

Three layers of autoscaling, each independent:

HPA (Horizontal Pod Autoscaler): scale pod replicas based on CPU, memory, or custom metrics (RPS, latency, queue depth). Default scale-up is fast, scale-down conservative to avoid flapping.
VPA (Vertical Pod Autoscaler): rightsize resource requests over time. Useful for batch and unpredictable workloads; clashes with HPA on the same metrics.
Cluster Autoscaler / Karpenter: add nodes when pods cannot schedule due to resource shortage; remove underutilised nodes. Karpenter is the modern AWS-native replacement, faster and more flexible than Cluster Autoscaler.

The classic mistake: HPA on CPU when the bottleneck is connection pool, database, or downstream RPC. Always scale on the metric closest to user latency - often p99 latency or RPS, not CPU.

Distributed Rate Limiting

The Rate Limiting Algorithms guide covers token bucket, sliding window, distributed Redis-Lua patterns, and adaptive rate limiting. Three production rules:

Layer rate limits: CDN volumetric, gateway per-API-key, application per-user-action.
Choose fail-open or fail-closed deliberately when the rate-limit service is unavailable.
Authentication endpoints get stricter limits than read endpoints.

Identifying the Bottleneck

Every system has a current bottleneck. The skill is identifying it before the user does. Common bottlenecks in order of frequency:

Database connection pool exhaustion (because pool size < concurrent demand).
Single-shard hot key in Redis or Cassandra.
Synchronous external API call with no caching.
Disk I/O on a single node (Raft fsync, database writes).
Single-threaded code path in an otherwise concurrent service.

The diagnostic flow: load test until something gives. Where is CPU? Where is memory? Where is the queue depth growing? Where is latency climbing first? The answer points at the bottleneck.

Cache Hierarchy in Practice

Distributed Rate Limiter Architecture

Self-Check Quiz

HPA scales on CPU. Your CPU is at 30%. Your service is throttled. What gives? (Answer: HPA is scaling on the wrong metric. Real bottleneck is probably connection pool, downstream RPC, or DB. Scale on the metric closest to user latency - RPS or p99 latency.)
Cache hit rate dropped from 95% to 60% overnight. What three things do you check? (Answer: recent deploy that changed key shape; eviction rate spike from memory pressure; downstream errors causing skipped writes.)
You add a CDN to a site already using Redis caching. Where do invalidations get hardest? (Answer: between layers. CDN may serve stale even after Redis is invalidated. Use surrogate keys or short TTLs at the CDN.)
Karpenter aggressively scales nodes down at night. The next morning, traffic spikes and pods take 3 minutes to schedule. What do you change? (Answer: warm pool / over-provisioning, or scale-down deferral. Karpenter is fast at scale-up but cold-start latency on a fresh node still bites.)

For deeper caching patterns including invalidation flows and multi-region cache architecture, read the Caching Strategies guide. For rate-limiter implementation specifics see the Rate Limiting Algorithms guide. The Kubernetes cheatsheet covers HPA/VPA/Karpenter operational patterns.

Real world

Where this shows up

AWS DynamoDB's burst capacity is a literal token-bucket implementation visible to users.
Cloudflare absorbs trillions of requests per day at the edge with a layered cache that serves most reads before any origin is involved.
Stripe enforces per-API-key rate limits with token-bucket counters in Redis Lua scripts.
Netflix uses adaptive concurrency limits (open-source library) to dynamically size connection pools based on observed latency.

Production notes

Keep these close

Profile workloads BEFORE setting resource requests. Most workloads request 2-3x what they use; right-sizing is direct cost savings.
Scale on the metric closest to user latency, not CPU. CPU at 30% with throttled latency means CPU is not your bottleneck.
For Karpenter on AWS, set node consolidation to be aggressive but combined with PodDisruptionBudgets so the consolidation does not cause outages.

Common mistakes

What usually breaks

Setting HPA on CPU when the database connection pool is the actual bottleneck.
Caching everything by default; sometimes the database is fast enough and the cache is just extra failure surface.
Cluster Autoscaler with no Pod Disruption Budgets; nodes scale down and take working pods with them.

Security risks

Threats to watch

Cache poisoning via unkeyed headers (Host, Vary mishandling) lets one attacker affect many users.
Multi-tenant cache without tenant_id in the key leaks data between customers; a known SOC2 incident class.
Rate-limit bypass via X-Forwarded-For spoofing when the origin trusts the wrong header.
CDN-cached responses that should never be cached (auth-bearing, per-user) are a recurring breach pattern (web-cache deception).

Tradeoffs

Design choices you should be able to defend

HPA on CPU

Pros

Simple, default
Works well for CPU-bound workloads

Cons

Wrong signal for I/O-bound workloads
Lag between CPU spike and request latency

HPA on RPS / queue depth (custom metrics)

Pros

Scales on the actual load signal
Faster reaction

Cons

Requires Prometheus Adapter
More tuning

KEDA event-driven scaling

Pros

Scales on Kafka lag, queue depth, etc.
Scale to zero when idle

Cons

Extra component to operate
Cold-start tax on scale-up

Alternatives

Other production approaches

Kubernetes HPA + Karpenter

Cloud-native autoscaling stack; the default on AWS.

KEDA event-driven autoscaling

Scale on Kafka lag, queue depth, custom metrics; can scale to zero.

Cluster Autoscaler

Pre-Karpenter node autoscaler; still useful on GCP/Azure.

Redis Cluster + Cluster Mode Enabled

Standard distributed cache; sharded by hash slot.

CDN-first architecture (Cloudflare Workers, Lambda@Edge)

Move logic to the edge; lowest latency, lowest cost at scale.

Think like an engineer

Questions to answer before shipping

Identify your bottleneck before scaling. Throwing replicas at a connection-pool problem makes it worse, not better.
For every cache, define the freshness contract upfront. “5 minutes stale is fine” vs “must reflect the latest write” drives the entire invalidation strategy.
Capacity planning is not a one-time exercise. Workload behaviour drifts; review monthly.

Key terms

Vocabulary used in this module

Stateless service

A service where any replica can serve any request without local state; foundation of horizontal scaling.

HPA

Kubernetes Horizontal Pod Autoscaler; scales replicas based on metrics.

Cluster Autoscaler / Karpenter

Kubernetes node autoscalers; add/remove nodes based on pending pods.

Thundering herd

Failure mode when many concurrent requests miss the cache and overwhelm the origin.

Cache stampede

Same as thundering herd; many concurrent recomputes of the same expired cache key.

Labs

Hands-on labs

90 minutesIntermediate

Lab 6.1 - HPA on Custom Metrics

Configure HPA based on RPS or queue depth via Prometheus Adapter; observe scale-up under load.

Deploy app + Prometheus + Prometheus Adapter
Define HPA on RPS metric
Generate load; watch replicas scale up
Cool down; watch scale down

View lab on GitHub

60 minutesIntermediate

Lab 6.2 - Cache-Aside with Stampede Protection

Implement cache-aside with per-key locking to prevent thundering herd.

Implement naive cache-aside
Reproduce stampede on cache expiry
Add per-key Redis lock for recompute
Verify single recompute under load

View lab on GitHub

60 minutesAdvanced

Lab 6.3 - Distributed Rate Limiter (Redis Lua)

Implement an atomic token-bucket rate limiter as a Redis Lua script; load test it.

Write Lua script for atomic token bucket update
Hit it from many concurrent clients
Verify the rate is enforced globally

View lab on GitHub

Recap

Key takeaways

Stateless services are the foundation of horizontal scale - remove state-leaking patterns first
Multi-layer caching multiplies capacity; pick a strategy per layer deliberately
Scale on the metric closest to user latency, not on CPU when CPU is not the bottleneck
Distributed rate limiting requires consensus or aggregation - pick the trade-off
Every system has a bottleneck; the engineering work is moving it before it bites

Related resources

Scalability Engineering

Learning objectives

Horizontal vs Vertical Scaling

Caching as a Scaling Lever

CDN - Caching at the Edge

Autoscaling on Kubernetes

Distributed Rate Limiting

Identifying the Bottleneck

Cache Hierarchy in Practice

Distributed Rate Limiter Architecture

Self-Check Quiz

Where this shows up

Keep these close

What usually breaks

Threats to watch

Design choices you should be able to defend

HPA on CPU

HPA on RPS / queue depth (custom metrics)

KEDA event-driven scaling

Other production approaches

Kubernetes HPA + Karpenter

KEDA event-driven autoscaling

Cluster Autoscaler

Redis Cluster + Cluster Mode Enabled

CDN-first architecture (Cloudflare Workers, Lambda@Edge)

Questions to answer before shipping

Vocabulary used in this module

Stateless service

HPA

Cluster Autoscaler / Karpenter

Thundering herd

Cache stampede

Hands-on labs

Lab 6.1 - HPA on Custom Metrics

Lab 6.2 - Cache-Aside with Stampede Protection

Lab 6.3 - Distributed Rate Limiter (Redis Lua)

Key takeaways

Keep learning across CodersSecret

Related guides

Cheatsheets

Interactive labs

Glossary terms