Module 11 of 12

Real-World Failure Scenarios

Retry storms, cache stampedes, split brain, hot partitions, queue overload, DNS outages, service-discovery failures, cascading failures - the incidents that actually happen, and how to engineer them away.

5 hours3 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Recognise the canonical distributed-systems failure modes by their telemetry signatures
Reproduce each failure in a controlled lab so the pattern is in your hands
Apply the architectural defences that make each failure hard or impossible
Write incident runbooks that an on-call engineer can actually use at 3am
Run a post-incident review that produces lasting improvements

Before

Outages debugged from cold; 90 minutes of cross-team Slack
Same outage class repeats every quarter; no learning loop
Post-mortems become theatre; action items slip indefinitely
On-call dreaded; new engineers take months to be effective

After

Per-failure-mode runbook library; triaged in minutes
Each incident strengthens the runbook; same class never repeats
Action items tracked alongside feature work; reviewed in next post-mortem
On-call is a known load; new engineers shadow then lead within weeks

Every distributed system fails the same way the others do. The taxonomy is small and the failure modes are well-documented, which means the patterns that defend against them are equally well-understood. The engineers who experience these incidents and build the defences are the ones the rest of the org calls when something is on fire.

This module walks the canonical incidents one at a time: what they look like, what causes them, how to defend.

Retry Storms

Symptom: a downstream service has a brownout. Errors trigger client retries. Retries multiply load on the failing service. The service cannot recover. p99 latency stays high; error rate stays high; you cannot get out of it without a deploy or restart.

Defence: every retry policy has a budget (cap retries at, say, 10% of total RPS). Exponential backoff with jitter. Circuit breakers stop calling the dead service so it can recover.

Cache Stampedes

Symptom: a hot cache key expires under load. Hundreds of concurrent requests miss the cache and hit the origin. Origin overloads. Stays overloaded until the cache is repopulated.

Defence: per-key locking on cache misses (only one recompute at a time). Probabilistic early expiration. Stale-while-revalidate semantics. The Caching Strategies guide covers all three patterns.

Split Brain

Symptom: a network partition isolates two halves of a CP cluster. Both elect leaders independently. Both accept writes. When the partition heals, you have divergent state.

Defence: real consensus algorithms (Raft, Paxos) require a majority quorum, so only one side can elect a leader; the other side cannot make progress. The lesson: do not invent your own “HA” without consensus underneath.

Hot Partitions

Symptom: one shard / partition / Redis slot receives 10x the traffic of the others. That node saturates while others sit idle. p99 latency on the hot key climbs; the rest of the system looks fine.

Defence: detect via per-partition QPS metrics. Mitigate by salting keys, splitting the hot key into N keys, or fronting with a per-pod local cache so the hot key never reaches the distributed cache.

Queue Overload

Symptom: producers outpace consumers. Queue depth grows. Eventually the queue runs out of memory or disk; messages start failing or get dropped.

Defence: backpressure. Bounded queues. Producer throttling on consumer-lag signals. Auto-scale consumers on lag.

DNS Outages

Symptom: DNS resolver is slow or unreachable. Every service-to-service call stalls on lookup. The cluster appears to be hanging without errors.

Defence: NodeLocal DNSCache to keep DNS off the critical path. Short TTLs combined with negative-caching tuning. Service-mesh-based discovery (sidecar handles endpoint changes via xDS, no DNS in the data path).

Service Discovery Failures

Symptom: discovery system (Consul, etcd, kube-apiserver) is unhealthy. Services cannot find each other. Existing connections work; new connections fail.

Defence: clients cache the last-known-good endpoint set with a generous TTL. The system tolerates a degraded discovery system if existing connections can survive the window.

Cascading Failures

The mother of all distributed-systems incidents. A small failure becomes a cluster-wide outage because every layer amplifies the load. The diagram above shows the basic shape: DB slows, Service B times out, Service A retries, the gateway queues up requests, users retry, the gateway saturates, more services fail.

Defences: circuit breakers per dependency. Retry budgets capping amplification. Load shedding (return 503 to a percentage of requests when saturated). Graceful degradation paths so a non-critical failure doesn't block critical paths. The Incident Response Simulator walks through real scenarios.

The Post-Incident Review

The blameless post-mortem is the discipline that turns incidents into learning. The structure that works:

Timeline: minute-by-minute account of what happened. No interpretation; just facts.
Impact: who was affected, for how long, in what way.
Root cause: what enabled the incident. Often multiple contributing factors.
Detection: how was the incident discovered? Could it have been earlier?
Mitigation: what stopped the incident? Was it the right action?
Action items: each one assigned, deadlined, tracked. Without these the document is theatre.

Cache Stampede Visualised

Split Brain Architecture

Self-Check Quiz

A retry storm is happening. You add exponential backoff. Symptoms partially improve. What did you miss? (Answer: backoff alone helps but does not cap total RPS to the failing service. Need a retry budget too.)
Your cache stampedes every 5 minutes for 2 seconds. The TTL is 5 minutes. What is the simplest fix? (Answer: probabilistic early expiration - a small chance to recompute before TTL - distributes load over time without coordination.)
You operate two regions in active-active. Both elect leaders during a partition. Why is this catastrophic for payments? (Answer: split brain. Both sides accept conflicting writes. Without consensus, reconciliation requires manual resolution. Active-active works for read-heavy data, not writes-with-consequence.)
Post-mortem action items consistently slip. What changes the dynamic? (Answer: assigned owner, deadline, tracked alongside feature work, reviewed in next post-mortem. Without follow-up the document is theatre.)

The Runtime Security cheatsheet covers detection patterns for the failure scenarios above. Practice with the Incident Response Simulator.

Production notes

Keep these close

Build a per-failure-mode runbook library. Each runbook has detection signals, immediate-action checklist, recovery steps, and post-incident actions.
Test runbooks in staging and chaos drills. Untested runbooks slow incident response, not speed it up.
Capture every incident as a learning artefact even if there was “no real impact”. Near-misses are the cheapest training data.

Common mistakes

What usually breaks

Skipping post-incident reviews on small incidents. The next bigger incident usually has the same root cause.
Action items without owners or deadlines. The post-mortem becomes theatre.
Treating retries as a fix instead of a load multiplier. Retries are a tool; budgets are the discipline.
Single-cause root-cause analysis. Real incidents have multiple contributing factors; the post-mortem should surface all of them.

Security risks

Threats to watch

Cascading failures often expose security gaps too - rate-limit bypasses, circuit-breaker fail-open behaviour, fallback paths that skip authz.
Incident response is when an attacker is most likely to slip in - on-call engineers are distracted; emergency commits skip review.
Post-mortems should include a security review: did this incident reveal a hardening gap? Add the hardening as an action item.

Tradeoffs

Design choices you should be able to defend

Manual incident response

Pros

Engineer judgment in the loop
Catches novel failures

Cons

Slow
Error-prone under pressure

Automated runbooks (PagerDuty, Rundeck)

Pros

Consistent execution
Fast

Cons

Only handles known patterns
Bad automation can amplify incidents

Hybrid (auto-mitigate, human-confirm)

Pros

Speed + judgment
Best of both

Cons

Tooling investment

Alternatives

Other production approaches

PagerDuty + automated runbooks

Industry standard incident management; ties detection to action.

Rundeck / Ansible AWX

Self-hosted runbook automation; tighter integration with infra.

AWS Incident Manager / GCP Cloud Operations

Cloud-native incident response; tighter integration with cloud telemetry.

FireHydrant / incident.io

Modern incident-management platforms with built-in post-mortem workflows.

Think like an engineer

Questions to answer before shipping

After every incident, ask: what would have prevented this entirely? Often the fix is upstream of the immediate cause.
Build a catalogue of failure modes. New incidents either match one (resolve fast) or are novel (capture and add to catalogue).
When designing a system, walk through the failure-mode catalogue mentally. Which of these can hit my system? What is my defence?

Key terms

Vocabulary used in this module

Retry storm

Failure mode where retries amplify load on a struggling backend.

Cache stampede

Many concurrent requests hit the origin when a hot cache key expires.

Split brain

A network partition causes two halves of a system to operate independently with divergent state.

Hot partition

A shard receiving disproportionately high traffic, overloading one node.

Post-mortem

Blameless review of an incident that produces tracked action items.

Labs

Hands-on labs

90 minutesIntermediate

Lab 11.1 - Reproduce a Retry Storm

Configure naive retries; cause an outage; add backoff + budget; observe recovery.

Set up 3-service chain with naive retries
Inject 50% errors on bottom service; observe storm
Add exponential backoff + budget; observe recovery

View lab on GitHub

60 minutesIntermediate

Lab 11.2 - Cache Stampede on Expiry

Cause a stampede when a hot key expires; add per-key locking; verify fix.

Identify a hot key with TTL 30s
Send 1000 concurrent requests at expiry; observe origin meltdown
Add per-key Redis lock for recompute
Repeat; verify single recompute

View lab on GitHub

60 minutesIntermediate

Lab 11.3 - Post-Incident Review

Write a complete post-mortem for one of the simulated incidents above.

Pick the retry-storm or cache-stampede incident
Reconstruct the timeline from logs/metrics
Document root cause, detection, mitigation
Define 3 concrete action items

View lab on GitHub

Recap

Key takeaways

Retry storms are caused by naive retries without budgets; cap them
Cache stampedes need per-key locking, probabilistic expiration, or stale-while-revalidate
Split brain is prevented by real consensus - do not invent HA without quorum math
Cascading failures need defences at every layer: circuit breakers, budgets, load shedding, degradation
Post-incident reviews convert incidents into engineering wins - without action items they are theatre

Related resources

Real-World Failure Scenarios

Learning objectives

Retry Storms

Cache Stampedes

Split Brain

Hot Partitions

Queue Overload

DNS Outages

Service Discovery Failures

Cascading Failures

The Post-Incident Review

Cache Stampede Visualised

Split Brain Architecture

Self-Check Quiz

Keep these close

What usually breaks

Threats to watch

Design choices you should be able to defend

Manual incident response

Automated runbooks (PagerDuty, Rundeck)

Hybrid (auto-mitigate, human-confirm)

Other production approaches

PagerDuty + automated runbooks

Rundeck / Ansible AWX

AWS Incident Manager / GCP Cloud Operations

FireHydrant / incident.io

Questions to answer before shipping

Vocabulary used in this module

Retry storm

Cache stampede

Split brain

Hot partition

Post-mortem

Hands-on labs

Lab 11.1 - Reproduce a Retry Storm

Lab 11.2 - Cache Stampede on Expiry

Lab 11.3 - Post-Incident Review

Key takeaways

Keep learning across CodersSecret

Related guides

Cheatsheets

Interactive labs

Glossary terms