Every cache eventually causes an outage if you do not design it right. The reasons are always the same family: stale data when invalidation lags, thundering herds when the cache expires under load, hot keys that overwhelm a single Redis node, cascading failure when the cache itself goes down and the underlying database cannot serve the resulting load. Caching is the most leveraged performance tool in your stack and one of the easiest to get subtly wrong.

This guide is a production walk through how real systems cache — the access patterns (cache-aside, write-through, write-back, read-through), the topologies (application caches, distributed caches, CDN edges), the failure modes (thundering herd, cache stampede, hot partitions), and the operational decisions that determine whether your cache makes the system faster or just makes the next outage harder to debug.

Why Caching, and What Caching Actually Buys You

Caching is the deliberate trade of memory and complexity for latency and database load. When it works it returns a result in microseconds (RAM lookup) instead of milliseconds (database query), and it absorbs orders of magnitude more read traffic than your underlying datastore can handle. When it does not work it serves stale data, masks real problems, and produces incidents that are harder to root-cause than "the database is slow".

Three rules of thumb before adding any cache:

  1. Measure first. If the underlying query is fast enough, caching adds complexity for no gain. The premature cache is a real anti-pattern.
  2. Define the freshness contract. How stale is acceptable? Five seconds for a product page is fine; five seconds for a bank balance is not. The contract drives the invalidation strategy.
  3. Plan for the cache being down or empty. Can the system serve traffic with the cache cold or unavailable? If not, you have built a critical dependency, not a performance optimisation.

The Access Patterns

Cache-Aside (Lazy Loading)

The application checks the cache first. On a miss, it queries the database, writes the result into the cache, and returns it. The cache and database are independent; the application coordinates. The most common pattern in production for read-heavy workloads.

def get_user(user_id):
    key = f"user:{user_id}"
    cached = redis.get(key)
    if cached:
        return json.loads(cached)
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    redis.setex(key, 300, json.dumps(user))   # TTL = 5 min
    return user

def update_user(user_id, data):
    db.execute("UPDATE users SET ... WHERE id = ?", user_id)
    redis.delete(f"user:{user_id}")           # invalidate

Pros: simple. The cache only contains data that has been requested. Survives a cold cache or a missing entry — the worst case is a database query.

Cons: first request after a miss pays the database cost (latency penalty). Stale data possible if invalidation is lost or the TTL is too long. The thundering herd problem (covered below) shows up exactly here.

Read-Through

The cache itself loads data from the database on a miss. The application talks only to the cache; the cache decides when to fetch. Common with caching libraries that wrap the database (NCache, Caffeine with a CacheLoader, Spring Cache abstraction).

Pros: application code is simpler — one access path. Loading logic centralised. Cache implementation can deduplicate concurrent loads of the same key (the "cache loader stampede" defence).

Cons: the cache layer must know how to query your database, which couples them. Less common in microservices because the cache is rarely allowed to talk to your database directly.

Write-Through

Every write goes to both the cache and the database synchronously. The cache is always consistent with the database (modulo the brief window between the two writes).

def update_user(user_id, data):
    db.execute("UPDATE users SET ... WHERE id = ?", user_id)
    redis.setex(f"user:{user_id}", 300, json.dumps(data))

Pros: never stale (within the write-completion window). Reads never miss for recently-written data.

Cons: writes are slower (two systems on the critical path). Wasted cache writes for data that is rarely or never read — you populate the cache for every write, not just for reads. Use only when most written data will be read soon.

Write-Back (Write-Behind)

Writes go to the cache only. The cache asynchronously flushes dirty entries to the database in batches. The database catches up later.

Pros: very fast writes — the database is never on the critical path. Naturally batches multiple updates to the same key. High write throughput.

Cons: data loss window if the cache crashes before flushing. Hard to reason about consistency. Only used when write throughput dominates and some data loss is acceptable (analytics counters, view counts, leaderboards).

CACHE ACCESS PATTERNS CACHE-ASIDE (read) App Cache App DB 1: GET 2 (miss): SELECT 3: row WRITE-THROUGH App Cache DB 1: SET 2: WRITE WRITE-BACK App Cache DB 1: SET async batch READ-THROUGH App Cache DB 1: GET on miss

Multi-Layer Cache Hierarchy

Real systems cache at multiple layers. A request for a product page might hit:

  1. Browser cache (Cache-Control headers, service worker) — nanoseconds.
  2. CDN edge cache (Cloudflare, Fastly, CloudFront) — ~10ms within the same continent.
  3. Origin reverse proxy cache (Varnish, NGINX) — ~5ms within the same datacentre.
  4. Application in-process cache (Caffeine, Guava, Python functools.lru_cache) — microseconds.
  5. Distributed cache (Redis, Memcached) — ~1–3ms within the same VPC.
  6. Database query cache / page cache — ~1ms for in-memory pages.
  7. The actual storage — tens of milliseconds for SSD, hundreds for cold storage.

Each layer has different invalidation cost, different consistency story, and different blast radius if it fails. The hierarchy is intentional — the higher the layer, the cheaper the hit and the harder the invalidation.

CACHE HIERARCHY faster ↑ slower ↓ Browser cache ~ns · Cache-Control / SW CDN edge (Cloudflare, Fastly, CloudFront) ~10ms · geo-distributed Reverse proxy (Varnish, NGINX) ~5ms · per-DC In-process cache (Caffeine, lru_cache) ~µs · per-pod Distributed cache (Redis, Memcached) ~1–3ms · cluster-wide Database / origin storage ~10s of ms or more cheaper invalidation ↑ authoritative ↓

Distributed Cache Topologies

Redis: Single-Node, Sentinel, Cluster

Redis is the de facto standard distributed cache because of its rich data structures (not just key/value — sorted sets, hashes, streams, HyperLogLog) and its operational maturity. Three topologies dominate:

  • Single-node with persistence (RDB snapshots + AOF). Simplest, no HA. Acceptable for non-critical caches; restart pauses are real.
  • Sentinel: a primary with one or more replicas, plus Sentinel processes that coordinate failover. Strong-ish HA — Sentinel orchestrates leader election among the Sentinels themselves and promotes a replica when the primary fails. The classic warning is split-brain during a partition: see the consensus discussion in the Distributed Systems Algorithms guide.
  • Cluster: 16384 hash slots partitioned across N primary nodes, with replicas per primary. Linear scalability for both memory and throughput. Rebalancing happens online via slot migration. Most large Redis deployments converge on Cluster mode, often with cloud-managed offerings like AWS ElastiCache or Memorystore.

Memcached

Memcached is the simpler counterpoint to Redis. Pure key/value, no persistence, no replication, no data structures — just a sharded LRU cache. Its strength is operational simplicity (Facebook famously runs many trillions of ops/day on Memcached) and predictable performance.

Memcached uses client-side consistent hashing for sharding. The libmemcached client is the de facto C client; Mcrouter (open-sourced by Facebook) is a proxy that adds connection pooling, replication, and pool management on top of plain Memcached.

Choose Memcached when you only need cache, you have huge fleets, and you value operational simplicity. Choose Redis when you need data structures, persistence, or pub/sub.

DISTRIBUTED CACHE TOPOLOGY (Redis Cluster) Application pod cluster-aware client CRC16(key) % 16384 → slot → primary Primary 1 slots 0–5460 ~33% of keyspace replicate Replica 1 read-only promotable Primary 2 slots 5461–10922 ~33% of keyspace Replica 2 read-only Primary 3 slots 10923–16383 ~33% of keyspace Replica 3 read-only Cluster gossip: primaries exchange slot ownership + node liveness state. Failover: replica promoted on majority consensus.

CDN Edge Caching

CDN caching is a distributed cache the size of the internet. Cloudflare, Fastly, Akamai, CloudFront each maintain hundreds of edge POPs caching content close to end users. The CDN behaves like a giant reverse-proxy cache, keyed by URL and modulated by request headers (Vary).

The contract between origin and edge is the cache headers. Cache-Control: public, max-age=3600, stale-while-revalidate=86400 tells the CDN: serve from cache for an hour, serve stale for up to a day while you fetch a fresh version in the background. Combined with surrogate keys (Fastly's primary differentiator), origins can purge collections of related cache entries by tag rather than by URL.

Cache Invalidation

“There are only two hard things in Computer Science: cache invalidation and naming things.”

— Phil Karlton

Invalidation is hard because you have to maintain a relationship between two systems — the cache and the source of truth — and the moment that relationship lags, you serve stale data. The strategies, ordered from simplest to most complex:

TTL Expiry

Every entry has a time-to-live; entries are evicted when they expire. Simplest strategy. Always-correct-eventually. The trade-off is freshness vs database load: short TTL = fresh but more misses; long TTL = stale but fewer misses.

The right TTL depends on the data's churn rate. Product page (changes daily): TTL = 1 hour. User profile (changes monthly): TTL = 24 hours. Geographic IP database (changes weekly): TTL = 6 hours. Configuration (changes minutes after a deploy): TTL = 30 seconds plus push-based bust.

Explicit Invalidation on Write

The application invalidates the cache when it writes to the database. The simplest version is DEL on each write; a correctness-first version uses transaction outboxes to ensure invalidation happens even if the write succeeds and the cache call fails.

The classic anti-pattern: write to database, then write to cache (rather than DEL). If two writes race, you can end up with the older value cached. Always invalidate (DEL); let the next read repopulate.

Event-Driven Invalidation (CDC)

Change Data Capture (CDC) tools (Debezium, AWS DMS, MaxwellDB) tail the database's binlog and emit events on every write. A small consumer translates events into cache invalidation calls. The application code does not have to remember to invalidate; the database itself is the source of invalidation events.

This pattern is the dominant invalidation strategy at large scale. LinkedIn, Netflix, and Slack all use CDC-driven cache invalidation. The trade-off is operational complexity (now you operate Kafka + a CDC pipeline) and event-handling latency (cache may be stale for a few hundred milliseconds after a database write).

Write-Through (Cache as Source of Truth)

Already covered above. The cache is updated atomically with the database. No invalidation needed because there is nothing to invalidate.

Surrogate Keys / Tag-Based Invalidation

For CDN caches, you tag cache entries with one or more surrogate keys (e.g. product:123, category:electronics). When the underlying data changes, you purge all entries tagged with the relevant key with a single API call. Fastly built this into their core; AWS CloudFront added it later as cache invalidation patterns.

Versioned Cache Keys

Bake a version into the cache key (user:123:v42). To invalidate, increment the version — old keys remain in cache but are never read again, and they expire naturally via TTL. Useful when you cannot reliably enumerate all entries to invalidate (e.g. precomputed search results).

CACHE INVALIDATION FLOW (CDC-based) App: write UPDATE users SET ... Database commits, writes WAL binlog / WAL Debezium / DMS tails the WAL Kafka topic user.changes Invalidator service consumes events DEL keys Distributed cache user:123 evicted App reads ↓ Next read for user:123 is a miss → refetch from DB → repopulate Total invalidation latency: typically 100–500ms after the DB commit

The Thundering Herd and Cache Stampede

The most common production cache failure: an entry expires under load. Many concurrent requests miss the cache simultaneously. They all query the database. The database falls over. A request that was supposed to be O(1) cache becomes O(N) database thrash.

Three defences:

1. Per-Key Locking (Lock + Recompute)

On miss, take a short-lived lock on the key. Whichever request gets the lock fetches and populates; others wait briefly and re-read the cache. Implementation:

def get_user_with_lock(user_id):
    key = f"user:{user_id}"
    cached = redis.get(key)
    if cached:
        return json.loads(cached)
    # Try to acquire lock
    lock_key = f"lock:{key}"
    if redis.set(lock_key, "1", nx=True, ex=10):
        try:
            user = db.query("SELECT * FROM users WHERE id = ?", user_id)
            redis.setex(key, 300, json.dumps(user))
            return user
        finally:
            redis.delete(lock_key)
    else:
        time.sleep(0.05)
        return get_user_with_lock(user_id)  # retry

Effective but adds latency for the losers. A failed lock-holder leaves the lock orphaned for the lock TTL — that is the worst-case latency penalty.

2. Probabilistic Early Expiration

Each request, with small probability, treats a still-valid entry as if it had expired and refreshes it asynchronously. The probability rises as the entry approaches its actual expiration. By the time the entry would have expired, it has already been preemptively refreshed by some lucky request.

This is the "XFetch" algorithm (Vattani, Chierichetti, Lowenstein, 2015). Beautifully avoids the synchronisation problem because no request is forced to wait.

3. Stale-While-Revalidate

Borrowed from HTTP cache headers (Cache-Control: stale-while-revalidate=N). On miss-or-expiry, return the stale value immediately and refresh asynchronously. Acceptable when small amounts of staleness are fine. Cloudflare and Fastly support it natively at the CDN layer.

CACHE STAMPEDE — AND ITS DEFENCES UNPROTECTED N requests Cache EXPIRED N queries DB melts PROTECTED (lock + recompute) N requests SETNX lock Cache + lock 1 query DB OK

Hot Keys and Hot Partitions

In a sharded cache (Redis Cluster, Memcached with consistent hashing, DynamoDB), every key maps to a single partition. If 90% of your traffic targets one key (the homepage product, the celebrity user's feed), that key's partition becomes a hotspot — saturating one node while others sit idle.

Detection: per-partition request rate metrics. Cassandra exposes per-token-range metrics; Redis Cluster exposes per-node QPS. A 10x difference between the busiest and average node is a clear hot-key signal.

Mitigation:

  • Local cache as a shield: each application pod caches the hot key in process for 1–2 seconds, fronting the distributed cache. Requests to the hot key never leave the pod.
  • Key splitting: instead of product:123, write to product:123:shard1...product:123:shardN and have readers pick a random shard. The hot key becomes N less-hot keys spread across partitions.
  • Cache the precomputed answer: if the hot key feeds 10 different views, compute all 10 once and cache them — avoid recomputation on every read.
  • CDN it: if it is GET-able, push it to the CDN with a short TTL. The hot key becomes the CDN's problem, which is built for it.

Eviction Policies

Caches have finite memory. When full, an eviction policy decides what to discard. Common choices:

  • LRU (Least Recently Used): evict the entry that has not been accessed for the longest time. The default in most caches. Approximated in Redis (full LRU is too expensive at scale; Redis samples a small subset and evicts the LRU among them).
  • LFU (Least Frequently Used): evict the least-accessed entry over a window. Better than LRU when access patterns have stable hot/cold distinction (e.g. evergreen vs viral content). Redis 4+ supports approximated LFU.
  • FIFO: evict in insertion order. Simple, rarely the best choice.
  • TTL-only (volatile-ttl in Redis): evict the entry with the soonest expiry first. Useful when you want strict TTL behaviour.
  • Random: evict at random. Surprisingly competitive with LRU in some workloads, much cheaper to implement.

Redis exposes the choice as maxmemory-policy: allkeys-lru, allkeys-lfu, volatile-lru, volatile-lfu, volatile-ttl, volatile-random, allkeys-random, noeviction. The noeviction setting refuses writes when full — useful for cache-as-truth use cases (queues, session stores) where data loss is unacceptable.

Multi-Region Caching

For globally distributed services, you cache regionally to keep latency low. The architecture choices:

  • Independent regional caches: each region has its own Redis cluster; invalidation events are fanned out via a regional pub/sub or via the CDC pipeline. Simple, eventually consistent across regions.
  • Active-active with conflict resolution: writes accepted in any region; replicas converge via CRDTs or LWW. Used by DynamoDB Global Tables.
  • Active-passive with read replicas: a single primary region, read replicas in other regions. Writes pay cross-region latency; reads are local. Used by many cache-as-database deployments.

The architectural decision tracks the CAP discussion: you cannot have global strong consistency and local low latency simultaneously.

MULTI-REGION CACHE ARCHITECTURE us-east region App fleet (us-east) Redis Cluster (us-east) DB primary writes & reads local eu-west region App fleet (eu-west) Redis Cluster (eu-west) DB read replica reads local; writes routed home Cross-region pipeline Kafka MirrorMaker Debezium CDC events Invalidator per region ~100–500ms eventual Each region: read locally; invalidate locally on cross-region writes propagated through CDC. Eventual consistency between regions; latency stays sub-millisecond per region.

Database-Layer Caching

Modern databases include their own caching layers, and the application cache often sits on top of these.

  • PostgreSQL shared_buffers: the database keeps recently-read pages in memory. Tuned via shared_buffers (typical 25% of RAM) and aided by the OS page cache.
  • MySQL InnoDB buffer pool: same idea. innodb_buffer_pool_size typically 50–75% of RAM.
  • MongoDB WiredTiger cache: in-memory cache of working set.
  • DynamoDB DAX: an opt-in in-memory cache that fronts DynamoDB transparently for read latency.

Sometimes the database's own cache is enough — if your working set fits in shared_buffers, you may not need a separate Redis at all. Always start with database tuning before introducing an external cache.

Kubernetes Caching Patterns

Kubernetes-specific caching patterns:

  • Sidecar cache: a Redis or Memcached container in the pod. Lowest latency (loopback), but unique cache per pod (multi-pod deployments duplicate). Useful for read-heavy single-tenant services.
  • StatefulSet cache: a dedicated Redis StatefulSet per cluster, accessed via Service DNS. The standard pattern for shared application caches.
  • External managed cache: AWS ElastiCache / GCP Memorystore. The right answer for production at any scale — outsource the operational burden.
  • HTTP cache via ingress: NGINX Ingress with proxy_cache directives, or a dedicated Varnish layer. Caches at the edge of the cluster.

Security Considerations

Caches multiply the blast radius of three security failure modes:

  1. Cross-tenant data leakage: a cache key that does not include tenant scope can serve one tenant's data to another. Always include tenant_id in the key.
  2. Cache poisoning: an attacker tricks the cache into storing malicious content. Most common with HTTP caches and unkeyed headers (e.g. caching based on Host header that the attacker controls). Use the Vary header carefully and validate cache keys against a whitelist.
  3. Sensitive data in cache: PII, secrets, tokens cached without thought. Treat the cache as a separate datastore for compliance purposes — encryption at rest, access control, audit logs.

The classic vulnerability is web cache deception (Omer Gil, 2017): an attacker requests example.com/account/profile.css — the CDN sees the .css extension and caches the response, but the application ignores the extension and serves the user's authenticated profile. Now the next request for the same path serves that profile to anyone. Cache only what the application explicitly marks as cacheable.

For the broader API security picture, see the API Attack & Defense Simulator for hands-on practice and the Cloud Native Security Engineering course for the systematic walk through.

Observability

Cache metrics that tell you whether the cache is helping or hurting:

  • Hit rate: hits / (hits + misses). Below ~80% suggests the cache is not earning its keep; above ~99.5% suggests TTLs may be too long or the cache is too oversized.
  • Eviction rate: how often entries are evicted before TTL expiry. High evictions mean memory pressure; either grow the cache or reduce dataset.
  • Cache latency p50/p99: should be sub-millisecond for in-region Redis. p99 spikes indicate hot keys, GC pauses, or network problems.
  • Origin load with cache vs without: derived metric showing the cache's actual value. Useful for capacity planning and for justifying the cache's existence.
  • Stampede signals: simultaneous miss spikes for the same key, or duplicate origin queries within a short window.

Frequently Asked Questions

Should I use Redis or Memcached?

Redis if you need data structures (sorted sets for leaderboards, streams for queues, hashes for object-style records), persistence, or replication. Memcached if you want pure key/value, operational simplicity, and predictable performance at huge scale. Most teams pick Redis by default and never look back.

How long should my TTL be?

The shortest TTL that keeps your hit rate acceptable. Start with 5 minutes for most user-facing reads, 1 hour for catalog/reference data, 24 hours for slow-moving content (geo databases, ML models). Measure hit rate; adjust.

Should I cache the request or the response?

Cache at the boundary that gives the most reuse. Caching the database query result is reusable across many endpoints; caching the rendered HTML is faster but only reusable per URL. Both are valid; many systems do both at different layers.

How do I avoid the “cache and database disagree” bug?

Use DEL on writes (not SET); use a CDC pipeline for cross-service invalidation; use short TTLs as a safety net so stale entries expire on their own; treat the database as authoritative and the cache as ephemeral.

Is in-process caching ever worth it on top of Redis?

Yes — for hot keys it can shave the 1ms Redis call to a few microseconds and dramatically reduce Redis load. Use a small TTL (1–5 seconds) so staleness is bounded, and keep the cache size small (Caffeine with size limit) to avoid memory pressure.

What about caching at the edge with workers (Cloudflare Workers, Lambda@Edge)?

Excellent fit for content that varies by region, country, or device class but does not need per-user customisation. Edge workers can compose responses from cached fragments and origin calls, often achieving 95%+ hit rates with sub-50ms latency globally.

Conclusion

Caching is the highest-leverage performance tool in your stack and one of the easiest to get subtly wrong. Every cache eventually causes an outage if you do not design for the failure modes — stale data, thundering herds, hot keys, cascading failure when the cache itself goes down. The systems that get it right are the ones that started with the failure modes in mind.

The high-leverage takeaways: measure first — do not cache what is already fast enough; define the freshness contract before choosing the strategy; cache-aside is the default; write-through only when most writes will be read soon; write-back only when data loss windows are acceptable; DEL on writes, never SET; multi-layer beats single-layer — CDN at the edge, distributed cache for shared application state, in-process for hot keys; defend against thundering herd before it bites you, not after; treat the cache as a tier with its own SLOs, observability, and security posture. The cache that does not emit hit-rate, eviction, latency, and stampede metrics is just guessing about whether it is helping.

Where to Go Next