Module 7 of 12

Reliability & Failure Engineering

Circuit breakers, bulkheads, graceful degradation, and chaos engineering - how reliability is engineered, not hoped for.

4 hours3 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Design retry policies that survive a downstream brownout
Implement circuit breakers and understand the half-open state
Apply bulkhead isolation to prevent noisy neighbours
Build graceful degradation paths that turn outages into reduced functionality
Run chaos experiments without breaking production

Before

No timeouts; one slow downstream stalls everything
Naive retries amplify load during brownouts
Single thread pool shared across all dependencies; one slow upstream exhausts it
Failure recovery never tested; first chaos experiment is the real outage

After

Per-call timeouts shorter than caller's deadline; deadline propagation across services
Retry budget caps amplification regardless of layer count
Bulkheads (separate pools per dependency); failure contained
Quarterly chaos drills; resilience patterns proven before they're needed

Reliability is not the absence of failure. It is the system continuing to do something useful when failure inevitably arrives. Every production engineer eventually learns that the question is not will this fail but how quickly will it recover, and will the failure be contained or amplified by the system around it.

The Resilience Toolkit

Five patterns, each addressing a different failure class:

Timeouts: never wait forever. Every external call has a deadline; the deadline is tighter than your caller's deadline so you can retry within budget.
Retries with exponential backoff and jitter: transient failures should be retried, with delay between attempts and randomness to avoid synchronisation.
Circuit breakers: when a dependency is clearly down, stop calling it for a window so it can recover.
Bulkheads: isolate workload pools so a noisy or failing tenant cannot starve the others.
Graceful degradation: when a non-critical dependency fails, return reduced functionality rather than a full error.

Circuit Breakers in Detail

A circuit breaker is a state machine with three states - CLOSED, OPEN, HALF-OPEN. In CLOSED state, requests flow normally. After N consecutive failures (or a failure rate above threshold over a window), the breaker trips to OPEN, and subsequent requests fail-fast without hitting the dependency. After a timeout (typically 5–30s), the breaker transitions to HALF-OPEN and allows a small number of trial requests; if they succeed, the breaker returns to CLOSED. If they fail, it goes back to OPEN.

The key behaviour: fail-fast in OPEN state. The breaker prevents the calling service from queuing up requests against a dead dependency, which would otherwise consume thread pool capacity and cascade the failure to the caller.

Production implementations: Hystrix (legacy, retired), Resilience4j (Java), Polly (.NET), Envoy circuit breaking, gRPC client interceptors, Istio destination rules with outlier detection. Service meshes do most of this for you.

Bulkheads

The bulkhead pattern (named for ship compartments) isolates resources so failure in one section cannot sink the ship. Two common forms:

Thread pool bulkheads: separate thread pools per dependency or per tenant. A slow upstream cannot exhaust the global thread pool.
Connection pool bulkheads: separate connection pools to different downstreams. The pool to a slow database does not starve calls to a fast one.

In Kubernetes, a stronger form: separate node pools or namespaces per tenant or per workload class. A misbehaving batch job in its own pool cannot impact the latency-sensitive tier.

Graceful Degradation

Real systems have many dependencies, only some of which are critical. Graceful degradation means: when a non-critical dependency fails, the system returns a useful but reduced response rather than an error.

Examples:

Recommendations service is down; product page returns without recommendations.
Sentiment analysis fails; review still posts, sentiment computed later.
Personalisation service is slow; anonymous experience served as fallback.
Cache is unreachable; fall through to database with a circuit breaker so DB is not overwhelmed.

The architectural insight: not all dependencies are equal. Identify the “critical path” (the dependencies whose failure must fail the request) and the “enriching path” (everything else). Treat them differently in your code - critical paths use timeouts and proper error propagation; enriching paths swallow errors with logging.

Chaos Engineering

Chaos engineering, popularised by Netflix, is the practice of intentionally injecting failures into production-like systems to validate that resilience patterns actually work. The principle: if you do not test failure handling, you do not know if it works.

Chaos experiments graduate in scope:

Local: kill a process; restart; verify it recovers.
Staging: kill a pod; verify HPA + traffic shift recover service.
Production (off-peak): kill a node; verify cluster autoscaler + pod rescheduling work.
Production (peak): planned game day; multi-team participation; document outcomes.

Tools: Chaos Mesh (Kubernetes-native), Gremlin (commercial), LitmusChaos, Toxiproxy (network-level), Pumba (container-level). Start small, build a chaos culture incrementally.

Cascading Failures

The most painful production outages are cascading: a small failure in one service causes load on dependencies, which cause load on their dependencies, which exhaust resources, which cause more failures. The cycle continues until the system stops.

Defences against cascading failure:

Per-dependency circuit breakers to break the chain.
Retry budgets to cap retry amplification (Module 2).
Rate limits on internal RPCs to enforce backpressure.
Graceful degradation paths that do not depend on the failing service.
Load shedding: when a service is overloaded, return 503 to a percentage of requests rather than degrading all of them.

Retry Amplification Visualised

Self-Check Quiz

Your circuit breaker is set to trip after 5 consecutive failures. Under normal load, you see brief 503 spikes that should not trip. What do you change? (Answer: switch from consecutive-failure to error-rate-over-window (e.g. 50% errors in last 30s). Consecutive failures are noisy.)
You have separate thread pools per dependency (bulkheads). One pool exhausts. What is the rest of your service doing? (Answer: still serving traffic to other dependencies. That is the whole point of the bulkhead - isolate failure.)
Why is graceful degradation hard to retrofit? (Answer: it requires identifying critical-path vs enriching dependencies and writing fallback paths. Adding it after the first incident means rewriting code under pressure.)
You start a chaos experiment in production. Within 5 seconds you have a real outage. What did you skip? (Answer: practice in staging first; start small (one pod, off-peak); have an explicit abort procedure; involve on-call.)

For runtime-detection patterns that pair with these resilience controls, see the Runtime Security cheatsheet and the Falco glossary entry. The Incident Response Simulator exercises chaos and triage scenarios.

Real world

Where this shows up

Netflix's Hystrix (now retired) shaped the industry's circuit-breaker pattern; Resilience4j is the modern Java implementation.
AWS uses bulkhead isolation extensively in their internal services to contain noisy-neighbour problems.
Cloudflare's graceful-degradation patterns let them serve cached responses during origin outages.
Netflix's Chaos Monkey (now Chaos Engineering) is the canonical example of running fault-injection in production deliberately.

Production notes

Keep these close

Use error-rate-over-window for circuit-breaker tripping, not consecutive-failure count. Consecutive counts trigger on noise.
Service meshes (Envoy/Istio/Linkerd) implement most resilience patterns at the data plane - do not rebuild them in application code if you have a mesh.
Identify critical-path vs enriching dependencies; treat them differently in code (critical = propagate errors, enriching = swallow with logging).
Run quarterly chaos game days. Untested resilience is hopeful resilience.

Common mistakes

What usually breaks

Setting infinite retries on non-idempotent calls. One downstream blip becomes duplicate side effects everywhere.
Cascading retries without budgets. A 3-hop chain with 3 retries each = 27x amplification on the failing service.
No fallback for “enriching” calls (recommendations, sentiment, personalisation). Their failure should NOT fail the request.
Chaos in production without practice in staging. The first real chaos experiment must not be your first chaos experiment.

Security risks

Threats to watch

Circuit breakers that fail-open during a partial outage may bypass authn / authz checks. Decide failure mode deliberately.
Chaos engineering in production without explicit access-control review can give attackers a roadmap of failure modes.
Graceful degradation paths often skip security checks (e.g. cache fallback returns data without re-checking permissions). Audit fallback paths.

Tradeoffs

Design choices you should be able to defend

Application-level resilience (Resilience4j, Polly)

Pros

Tight integration with code
Per-call control

Cons

Per-language implementation
Hard to enforce consistently across teams

Service-mesh resilience (Envoy/Istio)

Pros

Centralised, language-agnostic
Operator-controlled, no app changes

Cons

Sidecar latency tax
Operational complexity

Hybrid (mesh + app-level)

Pros

Right tool per scenario

Cons

Two layers to reason about

Alternatives

Other production approaches

Resilience4j (Java)

Modern circuit breaker / retry / bulkhead library; Hystrix successor.

Polly (.NET)

Equivalent for .NET; mature, well-documented.

Service mesh resilience (Envoy/Istio/Linkerd)

Centralised at the data plane; no app code changes.

Chaos Mesh

Kubernetes-native chaos platform; CRD-driven experiments.

Gremlin

Commercial chaos platform; broader cloud + non-K8s coverage.

LitmusChaos

Open-source chaos for Kubernetes; CNCF incubating.

Think like an engineer

Questions to answer before shipping

For every external call ask: what is the failure mode if this call hangs? If the answer is “our entire service hangs too”, you need a timeout AND a fallback.
Rank your dependencies by criticality once a quarter. Move enriching calls to non-blocking; tighten circuit breakers on critical ones.
Run a chaos drill before every major launch. The bug it finds is almost always something nobody predicted.

Key terms

Vocabulary used in this module

Circuit breaker

State machine that stops calling a failing dependency to prevent cascading failure.

Bulkhead

Resource isolation pattern; separate pools per dependency to contain failure.

Graceful degradation

Returning reduced functionality when a non-critical dependency fails.

Chaos engineering

Discipline of injecting failures into production-like systems to validate resilience.

Retry budget

Cap on retry traffic as a fraction of total RPS; prevents retry storms.

Labs

Hands-on labs

60 minutesIntermediate

Lab 7.1 - Circuit Breaker in Action

Implement a circuit breaker (Resilience4j or hand-rolled), trigger failure modes, validate state transitions.

Wrap a flaky downstream call with a circuit breaker
Inject 50% error rate; observe breaker trip to OPEN
Wait for timeout; observe HALF-OPEN; success returns to CLOSED
Compare with no breaker: caller threads exhaust

View lab on GitHub

120 minutesAdvanced

Lab 7.2 - Chaos Mesh on Kubernetes

Run chaos experiments on a kind cluster and observe how the application reacts.

Install Chaos Mesh on a kind cluster
Inject pod-kill, network-loss, CPU stress experiments
Verify HPA, retry policies, circuit breakers all work
Document a chaos game-day runbook

View lab on GitHub

60 minutesIntermediate

Lab 7.3 - Graceful Degradation Architecture

Refactor a service with multiple dependencies to degrade gracefully under partial failure.

Identify critical vs enriching dependencies
Add fallback responses for enriching dependencies
Inject failures and verify reduced-but-valid responses
Compare to baseline (full error on any dependency failure)

View lab on GitHub

Recap

Key takeaways

Reliability is engineered with timeouts, retries, circuit breakers, bulkheads, and graceful degradation - not hoped for
Service meshes (Envoy / Istio) implement most of these patterns for free; use them
Identify critical-path vs enriching dependencies and treat them differently in code
Chaos engineering is the only way to know your resilience patterns actually work
Cascading failures kill systems; break the chain with circuit breakers and retry budgets

Related resources

Reliability & Failure Engineering

Learning objectives

The Resilience Toolkit

Circuit Breakers in Detail

Bulkheads

Graceful Degradation

Chaos Engineering

Cascading Failures

Retry Amplification Visualised

Self-Check Quiz

Where this shows up

Keep these close

What usually breaks

Threats to watch

Design choices you should be able to defend

Application-level resilience (Resilience4j, Polly)

Service-mesh resilience (Envoy/Istio)

Hybrid (mesh + app-level)

Other production approaches

Resilience4j (Java)

Polly (.NET)

Service mesh resilience (Envoy/Istio/Linkerd)

Chaos Mesh

Gremlin

LitmusChaos

Questions to answer before shipping

Vocabulary used in this module

Circuit breaker

Bulkhead

Graceful degradation

Chaos engineering

Retry budget

Hands-on labs

Lab 7.1 - Circuit Breaker in Action

Lab 7.2 - Chaos Mesh on Kubernetes

Lab 7.3 - Graceful Degradation Architecture

Key takeaways

Keep learning across CodersSecret

Related guides

Cheatsheets

Interactive labs

Glossary terms