Module 12 of 12

Production System Design & Capstone

Multi-region architecture, disaster recovery, capacity planning, real production tradeoffs — and a capstone that integrates every module into one system.

6 hours3 labsFree

Start here

Learning objectives

  • Design a multi-region architecture and reason about its failure modes
  • Plan disaster recovery: RPO, RTO, runbooks, tested restoration
  • Design for cost — capacity reservations, autoscaling, regional placement
  • Read production architecture diagrams (Kafka, Kubernetes, Netflix, Uber, Google) and identify the trade-offs
  • Design a complete production system as the capstone and defend the choices

Before

  • DR documented but never tested; first real failover is the test
  • Single region; an AZ failure becomes a customer-impacting outage
  • Cost grows linearly with users; no architectural levers pulled
  • Capstone-level system design treated as senior interview hazing, not real skill

After

  • DR drilled quarterly; actual RTO measured and improved
  • Multi-region with explicit pattern (active-passive or sharded active-active)
  • Cost engineering as ongoing discipline: right-sizing, spot, reservations
  • Production architecture documented with defended trade-offs; portfolio-quality artefact
MULTI-REGION ACTIVE-ACTIVEus-east-1appdbcacheeu-west-1appdbcachecross-region replicationCDC for invalidationGlobal LB (Route 53 / GCLB)Each region serves local traffic. Async cross-region replication for shared state.Failover via DNS. RPO bounded by replication lag. RTO bounded by health-check + DNS TTL.

The capstone module ties every prior topic together. Production system design is the test of whether you can make CAP, replication, scaling, reliability, security, and observability cohere as a single coherent architecture. Most engineers can name the components; few can defend the choices.

Multi-Region Architecture

Three patterns:

  • Active-Passive: writes go to one region; standby region is a hot replica. Simple consistency story; failover is a deliberate, observable operation. RPO bounded by replication lag.
  • Active-Active sharded: each region owns a shard of the keyspace. Reads and writes local to that region. Cross-region traffic only for cross-shard operations. The right pattern for most globally-distributed systems.
  • Active-Active replicated: same data in every region; conflict resolution required. Useful for read-heavy global content (CDN-cached pages, profile reads). Dangerous for writes-with-consequence (payments) without globally-consistent storage like Spanner.

Disaster Recovery

Two metrics that drive every DR design:

  • RPO (Recovery Point Objective): how much data you can afford to lose. Driven by replication lag and backup cadence. RPO=0 requires synchronous cross-region replication (cost). RPO=1 hour means hourly backups suffice.
  • RTO (Recovery Time Objective): how long the recovery can take. Driven by failover automation, DNS TTL, warm-standby vs cold-restoration. RTO of minutes requires hot standbys; RTO of hours allows for backup-restore.

The DR test discipline: untested DR is theater. Run a quarterly DR drill: simulate a regional outage, fail over, measure actual RTO, restore, document gaps.

Capacity Planning

Capacity engineering is forecasting load and provisioning to meet it without over-spending. The discipline:

  • Track per-service RPS, p99 latency, resource utilisation over time.
  • Project growth from product roadmap and historic curves.
  • Identify bottleneck per service (CPU, RAM, connection pool, disk IOPS, downstream RPC).
  • Reserve capacity for traffic peaks (Black Friday, marketing events).
  • Set autoscaling boundaries that avoid surprises (HPA max not too low, not infinite).
  • Review monthly; update quarterly.

Cost Engineering

At scale, cloud spend becomes architectural. Three high-leverage levers:

  • Right-sizing: VPA recommendations or manual analysis. Most workloads request 2–3x what they use; cutting that is direct savings.
  • Spot / preemptible instances: 60–90% cheaper than on-demand. Use for batch, async, stateless web. Karpenter handles the eviction churn.
  • Reserved capacity / savings plans: 30–60% cheaper for committed baseline. Buy enough to cover the steady-state, on-demand for the rest.

Production Architecture Case Studies

Read three real production architectures and identify how they made every decision in this course:

  • Kafka at LinkedIn: trillions of events per day; partitioning by key, ZooKeeper (now KRaft) for metadata, MirrorMaker for cross-region.
  • Cassandra at Netflix: leaderless replication, multi-region with LOCAL_QUORUM, custom backup tooling, the original Chaos Monkey.
  • Uber's ringpop: SWIM gossip + consistent hashing for service partitioning.
  • Spanner at Google: globally-consistent SQL via TrueTime; the gold standard for multi-region strong consistency.

Capstone Project

Design and document a complete production distributed system. Your capstone should include:

  1. An architecture diagram showing every service, datastore, and external dependency.
  2. The communication choice per boundary (sync HTTP, async Kafka, mTLS, etc.).
  3. The data model and partitioning strategy per datastore.
  4. The replication strategy and consistency guarantees.
  5. The autoscaling and capacity plan.
  6. The reliability patterns (circuit breakers, timeouts, retries, degradation).
  7. The security architecture (workload identity, mTLS, authz).
  8. The observability stack (traces, metrics, logs, SLOs).
  9. The deployment architecture (CI/CD, regions, rollback strategy).
  10. The DR plan with RPO/RTO targets.

Defending the choices is the test. For each component you must be able to explain: why this and not the alternative. The answer is rarely the same for two systems — that is the discipline of system design.

Disaster Recovery Topology

DR TOPOLOGY — ACTIVE-PASSIVE WITH BACKUPPrimary regionApp tier (active)DB primaryBackup snapshotSecondary region (passive)App tier (warm standby)DB read replicaRestored from snapshotasync replication (RPO bound)snapshot copy (hourly)Failover: DNS / global LB redirects to secondary. RTO ≈ minutes. RPO ≈ replication lag.

Global Traffic Routing

GLOBAL TRAFFIC ROUTING (latency / geo-based)user-EUuser-USuser-APACGlobal LBRoute 53 / GCLB / Anycastlatency-based routingeu-west-1 regionus-east-1 regionap-southeast-1 regionHealth-check failover routes around dead regions; latency-based routing keeps users on the closest healthy region.

Self-Check Quiz

  1. Your business says “RPO=0, RTO=5 minutes” for the database. What does this require? (Answer: synchronous cross-region replication (or globally-consistent storage like Spanner) and automated failover. Both are expensive. Understand the cost before committing.)
  2. You design active-active across two regions for a payment system. Why is that risky? (Answer: split-brain on writes during partition. Active-active for payments needs sharded ownership (each shard active in one region) or globally-consistent storage.)
  3. Your DR drill takes 3 hours instead of the planned 30 minutes. What were the gaps? (Answer: typically — secrets restoration, DNS propagation, dependency-order startup, missing runbook for one component. Drills find these. Untested DR is theater.)
  4. Your monthly cloud bill grows 30% in a quarter. Two engineers spend a week analysing. What three levers usually pay back? (Answer: right-sizing requests, spot/preemptible for batch, reserved/savings plans for baseline. Each is 30-90% savings on the relevant bucket.)
  5. Why is the capstone exercise more valuable than another module of content? (Answer: production system design is a synthesis skill that transfers only with practice. The capstone forces you to apply every prior module's trade-offs to a single coherent architecture.)

Where to Go Next — Future Advanced Courses

This course gives you the foundations and operational fluency. Three directions for deeper specialisation, each available free on CodersSecret:

  • Mastering SPIFFE & SPIRE — 13 modules going deep on workload identity. The right next course after Module 8 (Distributed Security & Zero Trust) of this course.
  • Cloud Native Security Engineering — 16 modules on Kubernetes-native security. The right next course after Modules 8 and 10 of this course.
  • Production RAG Systems Engineering — 16 modules on AI-infrastructure-specific distributed systems patterns. The right next course if your distributed systems work is in AI/ML production.

For ongoing operational reference, the cheatsheets that align with this course: Kubernetes, Kubernetes Security, SPIFFE/SPIRE, OPA/Rego, API Security, Runtime Security, Service Mesh, DevSecOps.

Real world

Where this shows up

  • Stripe runs active-passive multi-region for payment processing with sub-minute failover.
  • Google Spanner is the canonical example of globally-consistent SQL with TrueTime-bounded uncertainty.
  • Netflix runs active-active across AWS regions for the streaming control plane (read-heavy, eventually consistent).
  • AWS DynamoDB Global Tables provide multi-region active-active with last-writer-wins; useful for read-heavy global content.

Production notes

Keep these close

  • Practice DR drills quarterly. Untested DR is theatre. Measure actual RTO; gap-fill before the real outage.
  • Capacity reviews monthly; full re-projection quarterly. Growth surprises do not have to be surprises.
  • Tag every cloud resource with cost-centre + service. Cost engineering needs attribution.
  • Buy reserved capacity for the steady-state baseline; on-demand for the burst; spot for batch.

Common mistakes

What usually breaks

  • Designing for “active-active multi-region” for write-heavy workloads without globally-consistent storage. Split brain on payments is catastrophic.
  • Confusing RPO and RTO. RPO is data loss tolerance; RTO is recovery time tolerance. Both need explicit SLOs.
  • Right-sizing once and forgetting. Workload behaviour drifts; right-sizing is a quarterly discipline.
  • No DR runbook for the dependency graph. Bringing services back in random order risks cascading retries on cold backends.

Security risks

Threats to watch

  • Multi-region implies cross-region replication; the wire is a new attack surface. Always TLS, ideally mTLS via SPIFFE federation.
  • DR backups in cold storage are still customer data; encrypt at rest with KMS, audit access.
  • Failover automation that bypasses change control becomes the attacker's lever (“simulate failure to force failover”).
  • Active-active with shared global IAM means one region's compromise = global compromise. Region-scoped credentials are safer.

Tradeoffs

Design choices you should be able to defend

Active-Passive multi-region

Pros

  • Simple consistency story
  • Bounded cost
  • Tested failover path

Cons

  • Standby capacity is “wasted”
  • Failover is an operation

Active-Active sharded

Pros

  • Each region serves local traffic
  • Clear ownership boundary

Cons

  • Cross-shard ops are expensive
  • Complex routing

Active-Active replicated (full data in every region)

Pros

  • Read locality everywhere

Cons

  • Conflict resolution required
  • Dangerous for writes-with-consequence without consistent storage

Globally-consistent (Spanner / CockroachDB)

Pros

  • Linearizable globally
  • Writes anywhere

Cons

  • Premium cost
  • Cross-region commit latency

Alternatives

Other production approaches

Active-Passive multi-region

Simplest, bounded RTO/RPO; standby capacity costs.

Active-Active sharded

Each region owns a shard; standard pattern for global SaaS.

Active-Active replicated

Same data in every region; only safe with globally-consistent storage.

Edge-first (Cloudflare Workers, Lambda@Edge)

Compute at the edge; lowest latency for global users.

Multi-cloud (avoid vendor lock-in)

Highest operational cost; pays back if vendor risk is real.

Think like an engineer

Questions to answer before shipping

  • Architecture is the discipline of saying no. Every “yes” to a feature locks in trade-offs that constrain the next year of decisions.
  • For every architectural choice, write down the alternative. If you cannot articulate the alternative, you do not understand your own choice.
  • When defending an architecture, frame it as “I chose X because the trade-off was Y vs Z; here is which I optimised for and why”. That framing is the difference between “senior” and “staff”.

Key terms

Vocabulary used in this module

RPO

Recovery Point Objective; how much data you can afford to lose in a disaster.

RTO

Recovery Time Objective; how long recovery can take.

Active-Active

Multiple regions accept reads and writes simultaneously.

Active-Passive

One region is primary; others are standby replicas activated only on failover.

Capacity planning

Discipline of forecasting load and provisioning to meet it efficiently.

Labs

Hands-on labs

120 minutesAdvanced

Lab 12.1 — Multi-Region Active-Active Demo

Stand up a small active-active app across two regions; observe failover.

  1. Deploy app + DB in two simulated regions (kind clusters)
  2. Configure CRDT-style or LWW conflict resolution
  3. Simulate partition; observe behaviour
  4. Heal partition; verify convergence
View lab on GitHub
120 minutesAdvanced

Lab 12.2 — DR Drill

Practice a full disaster recovery drill from backup.

  1. Take a snapshot of a stateful service
  2. Destroy the cluster
  3. Restore from snapshot to a fresh cluster
  4. Measure actual RTO; identify gaps
View lab on GitHub
4 hoursAdvanced

Lab 12.3 — Capstone Architecture Document

Produce a complete production architecture for a distributed system of your choosing.

  1. Pick a domain (e-commerce, payments, analytics, social)
  2. Document architecture per the 10-point list above
  3. Defend each choice with an alternative + the trade-off you made
  4. Submit to peer review
View lab on GitHub

Recap

Key takeaways

  • Multi-region is hard — pick active-passive or active-active sharded for most workloads
  • RPO and RTO drive every DR decision; untested DR is theater
  • Capacity planning is forecasting + bottleneck analysis + autoscaling discipline
  • Cost engineering is architectural at scale; right-size, spot, reserved capacity
  • The capstone is the test: can you defend every architectural choice with the trade-off?

Related resources

Keep learning across CodersSecret