Module 6 of 12

Scalability Engineering

Horizontal scaling, autoscaling, caching, CDNs, rate limiting — how production systems handle 10x and 100x traffic without 10x and 100x cost.

4 hours3 labsFree

Start here

Learning objectives

  • Design stateless services that scale horizontally without coordination
  • Pick the right caching strategy (cache-aside, write-through, write-back) for the workload
  • Configure Kubernetes HPA, VPA, and Cluster Autoscaler so they actually work
  • Implement distributed rate limiting that survives multi-region
  • Identify the scalability bottleneck before it becomes the outage

Before

  • Vertical scaling until the box maxes out; no headroom for growth
  • In-memory state on every replica; no horizontal scaling possible
  • Caches added reactively after the first outage; no invalidation strategy
  • HPA on CPU when bottleneck is connection pool; scaling helps until it doesn't

After

  • Stateless services with shared distributed cache + database; linear horizontal scale
  • Multi-layer cache hierarchy (CDN, in-process, distributed); each layer absorbs different load
  • Cache invalidation via CDC events; cache and database stay coherent
  • HPA on the metric closest to user latency; scaling responds to actual demand
SCALABILITY STACKCDN edgecached static + dynamic, geo-distributedLoad balancer + WAFL7 routing, rate-limit, edge authStateless service tierHPA scales pods on CPU / RPS / latencyDistributed cacheRedis Cluster, MemcachedAsync work queueKafka / SQS / RabbitMQSharded data storeCassandra, Dynamo, sharded PostgresEach layer absorbs a different class of load. Understanding which layer breaks first is the engineering skill.

Scalability is not adding more machines. Scalability is removing the contention points that prevent more machines from helping. Every system has a bottleneck; the question is whether the next 10x of load hits a bottleneck you have already moved or one that is still in the way.

Horizontal vs Vertical Scaling

Vertical scaling (bigger machines) hits hard limits and risks single points of failure. Horizontal scaling (more machines) is the path to real scale, but only works if your service is stateless or partitions correctly.

Stateless services are the foundation. Stateless means: any replica can serve any request. If you can swap one pod for another at any time without state migration, you can scale linearly. Common state-leaking patterns to avoid:

  • Local file caches that differ across replicas (move to Redis or shared filesystem).
  • Sticky sessions on the load balancer (use a session store like Redis instead).
  • In-process queues that hold work (move to Kafka/SQS).
  • Per-replica scheduled jobs (use a leader-elected singleton or distributed cron).

Caching as a Scaling Lever

Caching multiplies effective capacity. The Caching Strategies guide covers this in depth. The summary:

  • Cache-aside: app checks cache, falls back to DB, populates cache. The default for read-heavy workloads.
  • Write-through: writes go to cache + DB synchronously. The cache is always fresh; writes are slower.
  • Write-back: writes go to cache; cache flushes to DB asynchronously. Fast writes; data loss window.
  • Read-through: cache itself loads from DB on miss. Simpler app code; coupled cache and DB.

Multi-layer caching is the production reality: browser → CDN → L7 cache → in-process → Redis. Each layer has different invalidation cost and different blast radius.

CDN — Caching at the Edge

CDNs (Cloudflare, Fastly, Akamai, CloudFront) cache content at hundreds of edge POPs close to users. The contract between origin and edge is the Cache-Control header. public, max-age=3600, stale-while-revalidate=86400 tells the CDN: serve from cache for an hour, serve stale for a day while refreshing in the background.

Modern CDNs are also where you put: WAF, edge auth, geo routing, A/B test branching, and increasingly compute (Cloudflare Workers, Lambda@Edge). The edge is where the cheapest scaling lives.

Autoscaling on Kubernetes

Three layers of autoscaling, each independent:

  • HPA (Horizontal Pod Autoscaler): scale pod replicas based on CPU, memory, or custom metrics (RPS, latency, queue depth). Default scale-up is fast, scale-down conservative to avoid flapping.
  • VPA (Vertical Pod Autoscaler): rightsize resource requests over time. Useful for batch and unpredictable workloads; clashes with HPA on the same metrics.
  • Cluster Autoscaler / Karpenter: add nodes when pods cannot schedule due to resource shortage; remove underutilised nodes. Karpenter is the modern AWS-native replacement, faster and more flexible than Cluster Autoscaler.

The classic mistake: HPA on CPU when the bottleneck is connection pool, database, or downstream RPC. Always scale on the metric closest to user latency — often p99 latency or RPS, not CPU.

Distributed Rate Limiting

The Rate Limiting Algorithms guide covers token bucket, sliding window, distributed Redis-Lua patterns, and adaptive rate limiting. Three production rules:

  1. Layer rate limits: CDN volumetric, gateway per-API-key, application per-user-action.
  2. Choose fail-open or fail-closed deliberately when the rate-limit service is unavailable.
  3. Authentication endpoints get stricter limits than read endpoints.

Identifying the Bottleneck

Every system has a current bottleneck. The skill is identifying it before the user does. Common bottlenecks in order of frequency:

  • Database connection pool exhaustion (because pool size < concurrent demand).
  • Single-shard hot key in Redis or Cassandra.
  • Synchronous external API call with no caching.
  • Disk I/O on a single node (Raft fsync, database writes).
  • Single-threaded code path in an otherwise concurrent service.

The diagnostic flow: load test until something gives. Where is CPU? Where is memory? Where is the queue depth growing? Where is latency climbing first? The answer points at the bottleneck.

Cache Hierarchy in Practice

CACHE HIERARCHYBrowser~ns · Cache-Control headersCDN edge~10ms · Cloudflare/FastlyReverse proxy~5ms · Varnish/NGINXIn-process~µs · Caffeine/lru_cacheDistributed cache~1–3ms · Redis/Memcached

Distributed Rate Limiter Architecture

DISTRIBUTED RATE LIMITER (Redis Lua atomic)Gateway 1EVAL LuaEVAL LuaGateway 2EVAL LuaGateway NEVAL LuaRedis Clusterkey → slot → nodeatomic Lua = no race~1-2ms p99 / callOriginprotected fromclient abusePer-customer / per-API-key counters in Redis. Atomic Lua prevents race conditions across gateways.Centralised counter trades ~1ms latency for global enforcement.

Self-Check Quiz

  1. HPA scales on CPU. Your CPU is at 30%. Your service is throttled. What gives? (Answer: HPA is scaling on the wrong metric. Real bottleneck is probably connection pool, downstream RPC, or DB. Scale on the metric closest to user latency — RPS or p99 latency.)
  2. Cache hit rate dropped from 95% to 60% overnight. What three things do you check? (Answer: recent deploy that changed key shape; eviction rate spike from memory pressure; downstream errors causing skipped writes.)
  3. You add a CDN to a site already using Redis caching. Where do invalidations get hardest? (Answer: between layers. CDN may serve stale even after Redis is invalidated. Use surrogate keys or short TTLs at the CDN.)
  4. Karpenter aggressively scales nodes down at night. The next morning, traffic spikes and pods take 3 minutes to schedule. What do you change? (Answer: warm pool / over-provisioning, or scale-down deferral. Karpenter is fast at scale-up but cold-start latency on a fresh node still bites.)

For deeper caching patterns including invalidation flows and multi-region cache architecture, read the Caching Strategies guide. For rate-limiter implementation specifics see the Rate Limiting Algorithms guide. The Kubernetes cheatsheet covers HPA/VPA/Karpenter operational patterns.

Real world

Where this shows up

  • AWS DynamoDB&apos;s burst capacity is a literal token-bucket implementation visible to users.
  • Cloudflare absorbs trillions of requests per day at the edge with a layered cache that serves most reads before any origin is involved.
  • Stripe enforces per-API-key rate limits with token-bucket counters in Redis Lua scripts.
  • Netflix uses adaptive concurrency limits (open-source library) to dynamically size connection pools based on observed latency.

Production notes

Keep these close

  • Profile workloads BEFORE setting resource requests. Most workloads request 2-3x what they use; right-sizing is direct cost savings.
  • Scale on the metric closest to user latency, not CPU. CPU at 30% with throttled latency means CPU is not your bottleneck.
  • For Karpenter on AWS, set node consolidation to be aggressive but combined with PodDisruptionBudgets so the consolidation does not cause outages.

Common mistakes

What usually breaks

  • Setting HPA on CPU when the database connection pool is the actual bottleneck.
  • Caching everything by default; sometimes the database is fast enough and the cache is just extra failure surface.
  • Cluster Autoscaler with no Pod Disruption Budgets; nodes scale down and take working pods with them.

Security risks

Threats to watch

  • Cache poisoning via unkeyed headers (Host, Vary mishandling) lets one attacker affect many users.
  • Multi-tenant cache without tenant_id in the key leaks data between customers; a known SOC2 incident class.
  • Rate-limit bypass via X-Forwarded-For spoofing when the origin trusts the wrong header.
  • CDN-cached responses that should never be cached (auth-bearing, per-user) are a recurring breach pattern (web-cache deception).

Tradeoffs

Design choices you should be able to defend

HPA on CPU

Pros

  • Simple, default
  • Works well for CPU-bound workloads

Cons

  • Wrong signal for I/O-bound workloads
  • Lag between CPU spike and request latency

HPA on RPS / queue depth (custom metrics)

Pros

  • Scales on the actual load signal
  • Faster reaction

Cons

  • Requires Prometheus Adapter
  • More tuning

KEDA event-driven scaling

Pros

  • Scales on Kafka lag, queue depth, etc.
  • Scale to zero when idle

Cons

  • Extra component to operate
  • Cold-start tax on scale-up

Alternatives

Other production approaches

Kubernetes HPA + Karpenter

Cloud-native autoscaling stack; the default on AWS.

KEDA event-driven autoscaling

Scale on Kafka lag, queue depth, custom metrics; can scale to zero.

Cluster Autoscaler

Pre-Karpenter node autoscaler; still useful on GCP/Azure.

Redis Cluster + Cluster Mode Enabled

Standard distributed cache; sharded by hash slot.

CDN-first architecture (Cloudflare Workers, Lambda@Edge)

Move logic to the edge; lowest latency, lowest cost at scale.

Think like an engineer

Questions to answer before shipping

  • Identify your bottleneck before scaling. Throwing replicas at a connection-pool problem makes it worse, not better.
  • For every cache, define the freshness contract upfront. &ldquo;5 minutes stale is fine&rdquo; vs &ldquo;must reflect the latest write&rdquo; drives the entire invalidation strategy.
  • Capacity planning is not a one-time exercise. Workload behaviour drifts; review monthly.

Key terms

Vocabulary used in this module

Stateless service

A service where any replica can serve any request without local state; foundation of horizontal scaling.

HPA

Kubernetes Horizontal Pod Autoscaler; scales replicas based on metrics.

Cluster Autoscaler / Karpenter

Kubernetes node autoscalers; add/remove nodes based on pending pods.

Thundering herd

Failure mode when many concurrent requests miss the cache and overwhelm the origin.

Cache stampede

Same as thundering herd; many concurrent recomputes of the same expired cache key.

Labs

Hands-on labs

90 minutesIntermediate

Lab 6.1 — HPA on Custom Metrics

Configure HPA based on RPS or queue depth via Prometheus Adapter; observe scale-up under load.

  1. Deploy app + Prometheus + Prometheus Adapter
  2. Define HPA on RPS metric
  3. Generate load; watch replicas scale up
  4. Cool down; watch scale down
View lab on GitHub
60 minutesIntermediate

Lab 6.2 — Cache-Aside with Stampede Protection

Implement cache-aside with per-key locking to prevent thundering herd.

  1. Implement naive cache-aside
  2. Reproduce stampede on cache expiry
  3. Add per-key Redis lock for recompute
  4. Verify single recompute under load
View lab on GitHub
60 minutesAdvanced

Lab 6.3 — Distributed Rate Limiter (Redis Lua)

Implement an atomic token-bucket rate limiter as a Redis Lua script; load test it.

  1. Write Lua script for atomic token bucket update
  2. Hit it from many concurrent clients
  3. Verify the rate is enforced globally
View lab on GitHub

Recap

Key takeaways

  • Stateless services are the foundation of horizontal scale &mdash; remove state-leaking patterns first
  • Multi-layer caching multiplies capacity; pick a strategy per layer deliberately
  • Scale on the metric closest to user latency, not on CPU when CPU is not the bottleneck
  • Distributed rate limiting requires consensus or aggregation &mdash; pick the trade-off
  • Every system has a bottleneck; the engineering work is moving it before it bites

Related resources

Keep learning across CodersSecret