Before
- Outages debugged from cold; 90 minutes of cross-team Slack
- Same outage class repeats every quarter; no learning loop
- Post-mortems become theatre; action items slip indefinitely
- On-call dreaded; new engineers take months to be effective
Module 11 of 12
Retry storms, cache stampedes, split brain, hot partitions, queue overload, DNS outages, service-discovery failures, cascading failures — the incidents that actually happen, and how to engineer them away.
Start here
Before
After
Every distributed system fails the same way the others do. The taxonomy is small and the failure modes are well-documented, which means the patterns that defend against them are equally well-understood. The engineers who experience these incidents and build the defences are the ones the rest of the org calls when something is on fire.
This module walks the canonical incidents one at a time: what they look like, what causes them, how to defend.
Symptom: a downstream service has a brownout. Errors trigger client retries. Retries multiply load on the failing service. The service cannot recover. p99 latency stays high; error rate stays high; you cannot get out of it without a deploy or restart.
Defence: every retry policy has a budget (cap retries at, say, 10% of total RPS). Exponential backoff with jitter. Circuit breakers stop calling the dead service so it can recover.
Symptom: a hot cache key expires under load. Hundreds of concurrent requests miss the cache and hit the origin. Origin overloads. Stays overloaded until the cache is repopulated.
Defence: per-key locking on cache misses (only one recompute at a time). Probabilistic early expiration. Stale-while-revalidate semantics. The Caching Strategies guide covers all three patterns.
Symptom: a network partition isolates two halves of a CP cluster. Both elect leaders independently. Both accept writes. When the partition heals, you have divergent state.
Defence: real consensus algorithms (Raft, Paxos) require a majority quorum, so only one side can elect a leader; the other side cannot make progress. The lesson: do not invent your own “HA” without consensus underneath.
Symptom: one shard / partition / Redis slot receives 10x the traffic of the others. That node saturates while others sit idle. p99 latency on the hot key climbs; the rest of the system looks fine.
Defence: detect via per-partition QPS metrics. Mitigate by salting keys, splitting the hot key into N keys, or fronting with a per-pod local cache so the hot key never reaches the distributed cache.
Symptom: producers outpace consumers. Queue depth grows. Eventually the queue runs out of memory or disk; messages start failing or get dropped.
Defence: backpressure. Bounded queues. Producer throttling on consumer-lag signals. Auto-scale consumers on lag.
Symptom: DNS resolver is slow or unreachable. Every service-to-service call stalls on lookup. The cluster appears to be hanging without errors.
Defence: NodeLocal DNSCache to keep DNS off the critical path. Short TTLs combined with negative-caching tuning. Service-mesh-based discovery (sidecar handles endpoint changes via xDS, no DNS in the data path).
Symptom: discovery system (Consul, etcd, kube-apiserver) is unhealthy. Services cannot find each other. Existing connections work; new connections fail.
Defence: clients cache the last-known-good endpoint set with a generous TTL. The system tolerates a degraded discovery system if existing connections can survive the window.
The mother of all distributed-systems incidents. A small failure becomes a cluster-wide outage because every layer amplifies the load. The diagram above shows the basic shape: DB slows, Service B times out, Service A retries, the gateway queues up requests, users retry, the gateway saturates, more services fail.
Defences: circuit breakers per dependency. Retry budgets capping amplification. Load shedding (return 503 to a percentage of requests when saturated). Graceful degradation paths so a non-critical failure doesn't block critical paths. The Incident Response Simulator walks through real scenarios.
The blameless post-mortem is the discipline that turns incidents into learning. The structure that works:
The Runtime Security cheatsheet covers detection patterns for the failure scenarios above. Practice with the Incident Response Simulator.
Production notes
Common mistakes
Security risks
Tradeoffs
Pros
Cons
Pros
Cons
Pros
Cons
Alternatives
Industry standard incident management; ties detection to action.
Self-hosted runbook automation; tighter integration with infra.
Cloud-native incident response; tighter integration with cloud telemetry.
Modern incident-management platforms with built-in post-mortem workflows.
Think like an engineer
Key terms
Failure mode where retries amplify load on a struggling backend.
Many concurrent requests hit the origin when a hot cache key expires.
A network partition causes two halves of a system to operate independently with divergent state.
A shard receiving disproportionately high traffic, overloading one node.
Blameless review of an incident that produces tracked action items.
Labs
Configure naive retries; cause an outage; add backoff + budget; observe recovery.
Cause a stampede when a hot key expires; add per-key locking; verify fix.
Write a complete post-mortem for one of the simulated incidents above.
Recap
Related resources