Skip to main content

Module 7: Reliability & Failure Engineering

Circuit breakers, bulkheads, graceful degradation, and chaos engineering — how reliability is engineered, not hoped for.

4 hours. 3 hands-on labs. Free course module.

Learning Objectives

  • Design retry policies that survive a downstream brownout
  • Implement circuit breakers and understand the half-open state
  • Apply bulkhead isolation to prevent noisy neighbours
  • Build graceful degradation paths that turn outages into reduced functionality
  • Run chaos experiments without breaking production

Why This Matters

Reliability is what separates engineers who get woken up at 3am from engineers whose systems quietly do the right thing during partial failure. The patterns in this module — circuit breakers, bulkheads, graceful degradation, chaos — turn outages into reduced functionality. The teams that adopt them ship faster (because they can deploy with confidence) and sleep better (because partial failure is contained, not amplified).

CIRCUIT BREAKER STATE MACHINECLOSEDrequests passOPENrequests fail-fastHALF-OPENtrial requestsN consecutive failurestimeout elapsedtrial succeedstrial failsCLOSED → OPEN on N failures. OPEN → HALF-OPEN after timeout. HALF-OPEN → CLOSED on success or back to OPEN on failure.
Architecture diagram for Module 7: Reliability & Failure Engineering.

Lesson Content

Reliability is not the absence of failure. It is the system continuing to do something useful when failure inevitably arrives. Every production engineer eventually learns that the question is not will this fail but how quickly will it recover, and will the failure be contained or amplified by the system around it.

The Resilience Toolkit

Five patterns, each addressing a different failure class:

  • Timeouts: never wait forever. Every external call has a deadline; the deadline is tighter than your caller's deadline so you can retry within budget.
  • Retries with exponential backoff and jitter: transient failures should be retried, with delay between attempts and randomness to avoid synchronisation.
  • Circuit breakers: when a dependency is clearly down, stop calling it for a window so it can recover.
  • Bulkheads: isolate workload pools so a noisy or failing tenant cannot starve the others.
  • Graceful degradation: when a non-critical dependency fails, return reduced functionality rather than a full error.

Circuit Breakers in Detail

A circuit breaker is a state machine with three states — CLOSED, OPEN, HALF-OPEN. In CLOSED state, requests flow normally. After N consecutive failures (or a failure rate above threshold over a window), the breaker trips to OPEN, and subsequent requests fail-fast without hitting the dependency. After a timeout (typically 5–30s), the breaker transitions to HALF-OPEN and allows a small number of trial requests; if they succeed, the breaker returns to CLOSED. If they fail, it goes back to OPEN.

The key behaviour: fail-fast in OPEN state. The breaker prevents the calling service from queuing up requests against a dead dependency, which would otherwise consume thread pool capacity and cascade the failure to the caller.

Production implementations: Hystrix (legacy, retired), Resilience4j (Java), Polly (.NET), Envoy circuit breaking, gRPC client interceptors, Istio destination rules with outlier detection. Service meshes do most of this for you.

Bulkheads

The bulkhead pattern (named for ship compartments) isolates resources so failure in one section cannot sink the ship. Two common forms:

  • Thread pool bulkheads: separate thread pools per dependency or per tenant. A slow upstream cannot exhaust the global thread pool.
  • Connection pool bulkheads: separate connection pools to different downstreams. The pool to a slow database does not starve calls to a fast one.

In Kubernetes, a stronger form: separate node pools or namespaces per tenant or per workload class. A misbehaving batch job in its own pool cannot impact the latency-sensitive tier.

Graceful Degradation

Real systems have many dependencies, only some of which are critical. Graceful degradation means: when a non-critical dependency fails, the system returns a useful but reduced response rather than an error.

Examples:

  • Recommendations service is down; product page returns without recommendations.
  • Sentiment analysis fails; review still posts, sentiment computed later.
  • Personalisation service is slow; anonymous experience served as fallback.
  • Cache is unreachable; fall through to database with a circuit breaker so DB is not overwhelmed.

The architectural insight: not all dependencies are equal. Identify the “critical path” (the dependencies whose failure must fail the request) and the “enriching path” (everything else). Treat them differently in your code — critical paths use timeouts and proper error propagation; enriching paths swallow errors with logging.

Chaos Engineering

Chaos engineering, popularised by Netflix, is the practice of intentionally injecting failures into production-like systems to validate that resilience patterns actually work. The principle: if you do not test failure handling, you do not know if it works.

Chaos experiments graduate in scope:

  1. Local: kill a process; restart; verify it recovers.
  2. Staging: kill a pod; verify HPA + traffic shift recover service.
  3. Production (off-peak): kill a node; verify cluster autoscaler + pod rescheduling work.
  4. Production (peak): planned game day; multi-team participation; document outcomes.

Tools: Chaos Mesh (Kubernetes-native), Gremlin (commercial), LitmusChaos, Toxiproxy (network-level), Pumba (container-level). Start small, build a chaos culture incrementally.

Cascading Failures

The most painful production outages are cascading: a small failure in one service causes load on dependencies, which cause load on their dependencies, which exhaust resources, which cause more failures. The cycle continues until the system stops.

Defences against cascading failure:

  • Per-dependency circuit breakers to break the chain.
  • Retry budgets to cap retry amplification (Module 2).
  • Rate limits on internal RPCs to enforce backpressure.
  • Graceful degradation paths that do not depend on the failing service.
  • Load shedding: when a service is overloaded, return 503 to a percentage of requests rather than degrading all of them.

Retry Amplification Visualised

RETRY AMPLIFICATION (3 retries, no budget)Client (1k RPS)Service Aretries 3x3k RPSService Bretries 3x9k RPSDB melts9x amplificationEach layer multiplies retries. 3 layers × 3 retries = 27x amplification.Defence: retry budget caps total retries at, say, 10% of RPS regardless of layer count.Combine with circuit breakers and exponential backoff for full protection.

Self-Check Quiz

  1. Your circuit breaker is set to trip after 5 consecutive failures. Under normal load, you see brief 503 spikes that should not trip. What do you change? (Answer: switch from consecutive-failure to error-rate-over-window (e.g. 50% errors in last 30s). Consecutive failures are noisy.)
  2. You have separate thread pools per dependency (bulkheads). One pool exhausts. What is the rest of your service doing? (Answer: still serving traffic to other dependencies. That is the whole point of the bulkhead — isolate failure.)
  3. Why is graceful degradation hard to retrofit? (Answer: it requires identifying critical-path vs enriching dependencies and writing fallback paths. Adding it after the first incident means rewriting code under pressure.)
  4. You start a chaos experiment in production. Within 5 seconds you have a real outage. What did you skip? (Answer: practice in staging first; start small (one pod, off-peak); have an explicit abort procedure; involve on-call.)

For runtime-detection patterns that pair with these resilience controls, see the Runtime Security cheatsheet and the Falco glossary entry. The Incident Response Simulator exercises chaos and triage scenarios.

Real-World Use Cases

  • Netflix's Hystrix (now retired) shaped the industry's circuit-breaker pattern; Resilience4j is the modern Java implementation.
  • AWS uses bulkhead isolation extensively in their internal services to contain noisy-neighbour problems.
  • Cloudflare's graceful-degradation patterns let them serve cached responses during origin outages.
  • Netflix's Chaos Monkey (now Chaos Engineering) is the canonical example of running fault-injection in production deliberately.

Production Notes

  • Use error-rate-over-window for circuit-breaker tripping, not consecutive-failure count. Consecutive counts trigger on noise.
  • Service meshes (Envoy/Istio/Linkerd) implement most resilience patterns at the data plane — do not rebuild them in application code if you have a mesh.
  • Identify critical-path vs enriching dependencies; treat them differently in code (critical = propagate errors, enriching = swallow with logging).
  • Run quarterly chaos game days. Untested resilience is hopeful resilience.

Common Mistakes

  • Setting infinite retries on non-idempotent calls. One downstream blip becomes duplicate side effects everywhere.
  • Cascading retries without budgets. A 3-hop chain with 3 retries each = 27x amplification on the failing service.
  • No fallback for “enriching” calls (recommendations, sentiment, personalisation). Their failure should NOT fail the request.
  • Chaos in production without practice in staging. The first real chaos experiment must not be your first chaos experiment.

Security Risks to Watch

  • Circuit breakers that fail-open during a partial outage may bypass authn / authz checks. Decide failure mode deliberately.
  • Chaos engineering in production without explicit access-control review can give attackers a roadmap of failure modes.
  • Graceful degradation paths often skip security checks (e.g. cache fallback returns data without re-checking permissions). Audit fallback paths.

Design Tradeoffs

Application-level resilience (Resilience4j, Polly)

Pros

  • Tight integration with code
  • Per-call control

Cons

  • Per-language implementation
  • Hard to enforce consistently across teams

Service-mesh resilience (Envoy/Istio)

Pros

  • Centralised, language-agnostic
  • Operator-controlled, no app changes

Cons

  • Sidecar latency tax
  • Operational complexity

Hybrid (mesh + app-level)

Pros

  • Right tool per scenario

Cons

  • Two layers to reason about

Production Alternatives

  • Resilience4j (Java): Modern circuit breaker / retry / bulkhead library; Hystrix successor.
  • Polly (.NET): Equivalent for .NET; mature, well-documented.
  • Service mesh resilience (Envoy/Istio/Linkerd): Centralised at the data plane; no app code changes.
  • Chaos Mesh: Kubernetes-native chaos platform; CRD-driven experiments.
  • Gremlin: Commercial chaos platform; broader cloud + non-K8s coverage.
  • LitmusChaos: Open-source chaos for Kubernetes; CNCF incubating.

Think Like an Engineer

  • For every external call ask: what is the failure mode if this call hangs? If the answer is “our entire service hangs too”, you need a timeout AND a fallback.
  • Rank your dependencies by criticality once a quarter. Move enriching calls to non-blocking; tighten circuit breakers on critical ones.
  • Run a chaos drill before every major launch. The bug it finds is almost always something nobody predicted.

Production Story

A streaming video platform's user-profile service had a rare 30-second slow-response window. Their product service called it on every page load with no timeout. During the slow window, every request to the product page hung for 30 seconds; users hit refresh, multiplying load 5x; the product service exhausted its thread pool; the entire site went down. Root cause: missing timeout. Fix: add 200ms timeout + fallback to cached profile. The fix was 5 lines of code; the outage was 47 minutes.

Key Terms

Circuit breaker
State machine that stops calling a failing dependency to prevent cascading failure.
Bulkhead
Resource isolation pattern; separate pools per dependency to contain failure.
Graceful degradation
Returning reduced functionality when a non-critical dependency fails.
Chaos engineering
Discipline of injecting failures into production-like systems to validate resilience.
Retry budget
Cap on retry traffic as a fraction of total RPS; prevents retry storms.

Hands-On Labs

  1. Lab 7.1 — Circuit Breaker in Action

    Implement a circuit breaker (Resilience4j or hand-rolled), trigger failure modes, validate state transitions.

    60 minutes - Intermediate

    • Wrap a flaky downstream call with a circuit breaker
    • Inject 50% error rate; observe breaker trip to OPEN
    • Wait for timeout; observe HALF-OPEN; success returns to CLOSED
    • Compare with no breaker: caller threads exhaust

    View lab files on GitHub

  2. Lab 7.2 — Chaos Mesh on Kubernetes

    Run chaos experiments on a kind cluster and observe how the application reacts.

    120 minutes - Advanced

    • Install Chaos Mesh on a kind cluster
    • Inject pod-kill, network-loss, CPU stress experiments
    • Verify HPA, retry policies, circuit breakers all work
    • Document a chaos game-day runbook

    View lab files on GitHub

  3. Lab 7.3 — Graceful Degradation Architecture

    Refactor a service with multiple dependencies to degrade gracefully under partial failure.

    60 minutes - Intermediate

    • Identify critical vs enriching dependencies
    • Add fallback responses for enriching dependencies
    • Inject failures and verify reduced-but-valid responses
    • Compare to baseline (full error on any dependency failure)

    View lab files on GitHub

Key Takeaways

  • Reliability is engineered with timeouts, retries, circuit breakers, bulkheads, and graceful degradation — not hoped for
  • Service meshes (Envoy / Istio) implement most of these patterns for free; use them
  • Identify critical-path vs enriching dependencies and treat them differently in code
  • Chaos engineering is the only way to know your resilience patterns actually work
  • Cascading failures kill systems; break the chain with circuit breakers and retry budgets