Module 11: Real-World Failure Scenarios
Retry storms, cache stampedes, split brain, hot partitions, queue overload, DNS outages, service-discovery failures, cascading failures — the incidents that actually happen, and how to engineer them away.
5 hours. 3 hands-on labs. Free course module.
Learning Objectives
- Recognise the canonical distributed-systems failure modes by their telemetry signatures
- Reproduce each failure in a controlled lab so the pattern is in your hands
- Apply the architectural defences that make each failure hard or impossible
- Write incident runbooks that an on-call engineer can actually use at 3am
- Run a post-incident review that produces lasting improvements
Why This Matters
Real production engineers are recognised by the incidents they have absorbed and the runbooks they own. The patterns in this module — retry storms, stampedes, split brain, hot partitions, queue overload, DNS outages, cascading failure — are the same outage taxonomy across every company at every scale. Engineers who internalise them respond in minutes; engineers who do not spend hours reconstructing what should have been recognised in the first thirty seconds.
Lesson Content
Every distributed system fails the same way the others do. The taxonomy is small and the failure modes are well-documented, which means the patterns that defend against them are equally well-understood. The engineers who experience these incidents and build the defences are the ones the rest of the org calls when something is on fire.
This module walks the canonical incidents one at a time: what they look like, what causes them, how to defend.
Retry Storms
Symptom: a downstream service has a brownout. Errors trigger client retries. Retries multiply load on the failing service. The service cannot recover. p99 latency stays high; error rate stays high; you cannot get out of it without a deploy or restart.
Defence: every retry policy has a budget (cap retries at, say, 10% of total RPS). Exponential backoff with jitter. Circuit breakers stop calling the dead service so it can recover.
Cache Stampedes
Symptom: a hot cache key expires under load. Hundreds of concurrent requests miss the cache and hit the origin. Origin overloads. Stays overloaded until the cache is repopulated.
Defence: per-key locking on cache misses (only one recompute at a time). Probabilistic early expiration. Stale-while-revalidate semantics. The Caching Strategies guide covers all three patterns.
Split Brain
Symptom: a network partition isolates two halves of a CP cluster. Both elect leaders independently. Both accept writes. When the partition heals, you have divergent state.
Defence: real consensus algorithms (Raft, Paxos) require a majority quorum, so only one side can elect a leader; the other side cannot make progress. The lesson: do not invent your own “HA” without consensus underneath.
Hot Partitions
Symptom: one shard / partition / Redis slot receives 10x the traffic of the others. That node saturates while others sit idle. p99 latency on the hot key climbs; the rest of the system looks fine.
Defence: detect via per-partition QPS metrics. Mitigate by salting keys, splitting the hot key into N keys, or fronting with a per-pod local cache so the hot key never reaches the distributed cache.
Queue Overload
Symptom: producers outpace consumers. Queue depth grows. Eventually the queue runs out of memory or disk; messages start failing or get dropped.
Defence: backpressure. Bounded queues. Producer throttling on consumer-lag signals. Auto-scale consumers on lag.
DNS Outages
Symptom: DNS resolver is slow or unreachable. Every service-to-service call stalls on lookup. The cluster appears to be hanging without errors.
Defence: NodeLocal DNSCache to keep DNS off the critical path. Short TTLs combined with negative-caching tuning. Service-mesh-based discovery (sidecar handles endpoint changes via xDS, no DNS in the data path).
Service Discovery Failures
Symptom: discovery system (Consul, etcd, kube-apiserver) is unhealthy. Services cannot find each other. Existing connections work; new connections fail.
Defence: clients cache the last-known-good endpoint set with a generous TTL. The system tolerates a degraded discovery system if existing connections can survive the window.
Cascading Failures
The mother of all distributed-systems incidents. A small failure becomes a cluster-wide outage because every layer amplifies the load. The diagram above shows the basic shape: DB slows, Service B times out, Service A retries, the gateway queues up requests, users retry, the gateway saturates, more services fail.
Defences: circuit breakers per dependency. Retry budgets capping amplification. Load shedding (return 503 to a percentage of requests when saturated). Graceful degradation paths so a non-critical failure doesn't block critical paths. The Incident Response Simulator walks through real scenarios.
The Post-Incident Review
The blameless post-mortem is the discipline that turns incidents into learning. The structure that works:
- Timeline: minute-by-minute account of what happened. No interpretation; just facts.
- Impact: who was affected, for how long, in what way.
- Root cause: what enabled the incident. Often multiple contributing factors.
- Detection: how was the incident discovered? Could it have been earlier?
- Mitigation: what stopped the incident? Was it the right action?
- Action items: each one assigned, deadlined, tracked. Without these the document is theatre.
Cache Stampede Visualised
Split Brain Architecture
Self-Check Quiz
- A retry storm is happening. You add exponential backoff. Symptoms partially improve. What did you miss? (Answer: backoff alone helps but does not cap total RPS to the failing service. Need a retry budget too.)
- Your cache stampedes every 5 minutes for 2 seconds. The TTL is 5 minutes. What is the simplest fix? (Answer: probabilistic early expiration — a small chance to recompute before TTL — distributes load over time without coordination.)
- You operate two regions in active-active. Both elect leaders during a partition. Why is this catastrophic for payments? (Answer: split brain. Both sides accept conflicting writes. Without consensus, reconciliation requires manual resolution. Active-active works for read-heavy data, not writes-with-consequence.)
- Post-mortem action items consistently slip. What changes the dynamic? (Answer: assigned owner, deadline, tracked alongside feature work, reviewed in next post-mortem. Without follow-up the document is theatre.)
The Runtime Security cheatsheet covers detection patterns for the failure scenarios above. Practice with the Incident Response Simulator.
Production Notes
- Build a per-failure-mode runbook library. Each runbook has detection signals, immediate-action checklist, recovery steps, and post-incident actions.
- Test runbooks in staging and chaos drills. Untested runbooks slow incident response, not speed it up.
- Capture every incident as a learning artefact even if there was “no real impact”. Near-misses are the cheapest training data.
Common Mistakes
- Skipping post-incident reviews on small incidents. The next bigger incident usually has the same root cause.
- Action items without owners or deadlines. The post-mortem becomes theatre.
- Treating retries as a fix instead of a load multiplier. Retries are a tool; budgets are the discipline.
- Single-cause root-cause analysis. Real incidents have multiple contributing factors; the post-mortem should surface all of them.
Security Risks to Watch
- Cascading failures often expose security gaps too — rate-limit bypasses, circuit-breaker fail-open behaviour, fallback paths that skip authz.
- Incident response is when an attacker is most likely to slip in — on-call engineers are distracted; emergency commits skip review.
- Post-mortems should include a security review: did this incident reveal a hardening gap? Add the hardening as an action item.
Design Tradeoffs
Manual incident response
Pros
- Engineer judgment in the loop
- Catches novel failures
Cons
- Slow
- Error-prone under pressure
Automated runbooks (PagerDuty, Rundeck)
Pros
- Consistent execution
- Fast
Cons
- Only handles known patterns
- Bad automation can amplify incidents
Hybrid (auto-mitigate, human-confirm)
Pros
- Speed + judgment
- Best of both
Cons
- Tooling investment
Production Alternatives
- PagerDuty + automated runbooks: Industry standard incident management; ties detection to action.
- Rundeck / Ansible AWX: Self-hosted runbook automation; tighter integration with infra.
- AWS Incident Manager / GCP Cloud Operations: Cloud-native incident response; tighter integration with cloud telemetry.
- FireHydrant / incident.io: Modern incident-management platforms with built-in post-mortem workflows.
Think Like an Engineer
- After every incident, ask: what would have prevented this entirely? Often the fix is upstream of the immediate cause.
- Build a catalogue of failure modes. New incidents either match one (resolve fast) or are novel (capture and add to catalogue).
- When designing a system, walk through the failure-mode catalogue mentally. Which of these can hit my system? What is my defence?
Production Story
A retail platform's peak-Sunday traffic caused a Redis hot-partition incident on the cart-service. p99 spiked; users hit refresh; refresh multiplied load; circuit breakers in the gateway tripped; the gateway returned 503 for everything. The on-call engineer recognised the pattern within 90 seconds (cart latency in dashboard + Redis per-partition QPS imbalance), salted the cart-key (cart:user_id:0..9), and the system recovered. The runbook had been written 6 months earlier from a similar incident at a different company. The lesson: lessons travel; runbooks are the medium.
Key Terms
- Retry storm
- Failure mode where retries amplify load on a struggling backend.
- Cache stampede
- Many concurrent requests hit the origin when a hot cache key expires.
- Split brain
- A network partition causes two halves of a system to operate independently with divergent state.
- Hot partition
- A shard receiving disproportionately high traffic, overloading one node.
- Post-mortem
- Blameless review of an incident that produces tracked action items.
Hands-On Labs
-
Lab 11.1 — Reproduce a Retry Storm
Configure naive retries; cause an outage; add backoff + budget; observe recovery.
90 minutes - Intermediate
- Set up 3-service chain with naive retries
- Inject 50% errors on bottom service; observe storm
- Add exponential backoff + budget; observe recovery
-
Lab 11.2 — Cache Stampede on Expiry
Cause a stampede when a hot key expires; add per-key locking; verify fix.
60 minutes - Intermediate
- Identify a hot key with TTL 30s
- Send 1000 concurrent requests at expiry; observe origin meltdown
- Add per-key Redis lock for recompute
- Repeat; verify single recompute
-
Lab 11.3 — Post-Incident Review
Write a complete post-mortem for one of the simulated incidents above.
60 minutes - Intermediate
- Pick the retry-storm or cache-stampede incident
- Reconstruct the timeline from logs/metrics
- Document root cause, detection, mitigation
- Define 3 concrete action items
Key Takeaways
- Retry storms are caused by naive retries without budgets; cap them
- Cache stampedes need per-key locking, probabilistic expiration, or stale-while-revalidate
- Split brain is prevented by real consensus — do not invent HA without quorum math
- Cascading failures need defences at every layer: circuit breakers, budgets, load shedding, degradation
- Post-incident reviews convert incidents into engineering wins — without action items they are theatre