Skip to main content

Module 12: Production System Design & Capstone

Multi-region architecture, disaster recovery, capacity planning, real production tradeoffs — and a capstone that integrates every module into one system.

6 hours. 3 hands-on labs. Free course module.

Learning Objectives

  • Design a multi-region architecture and reason about its failure modes
  • Plan disaster recovery: RPO, RTO, runbooks, tested restoration
  • Design for cost — capacity reservations, autoscaling, regional placement
  • Read production architecture diagrams (Kafka, Kubernetes, Netflix, Uber, Google) and identify the trade-offs
  • Design a complete production system as the capstone and defend the choices

Why This Matters

This is the module that proves you can do system design at the senior / staff level. Anyone can list components; the engineer who can <em>defend</em> the choices &mdash; explain why this database and not that one, this consistency model and not the next, this region pattern and not the alternative &mdash; is the engineer who gets trusted with the architecture role. The capstone is your portfolio.

MULTI-REGION ACTIVE-ACTIVEus-east-1appdbcacheeu-west-1appdbcachecross-region replicationCDC for invalidationGlobal LB (Route 53 / GCLB)Each region serves local traffic. Async cross-region replication for shared state.Failover via DNS. RPO bounded by replication lag. RTO bounded by health-check + DNS TTL.
Architecture diagram for Module 12: Production System Design & Capstone.

Lesson Content

The capstone module ties every prior topic together. Production system design is the test of whether you can make CAP, replication, scaling, reliability, security, and observability cohere as a single coherent architecture. Most engineers can name the components; few can defend the choices.

Multi-Region Architecture

Three patterns:

  • Active-Passive: writes go to one region; standby region is a hot replica. Simple consistency story; failover is a deliberate, observable operation. RPO bounded by replication lag.
  • Active-Active sharded: each region owns a shard of the keyspace. Reads and writes local to that region. Cross-region traffic only for cross-shard operations. The right pattern for most globally-distributed systems.
  • Active-Active replicated: same data in every region; conflict resolution required. Useful for read-heavy global content (CDN-cached pages, profile reads). Dangerous for writes-with-consequence (payments) without globally-consistent storage like Spanner.

Disaster Recovery

Two metrics that drive every DR design:

  • RPO (Recovery Point Objective): how much data you can afford to lose. Driven by replication lag and backup cadence. RPO=0 requires synchronous cross-region replication (cost). RPO=1 hour means hourly backups suffice.
  • RTO (Recovery Time Objective): how long the recovery can take. Driven by failover automation, DNS TTL, warm-standby vs cold-restoration. RTO of minutes requires hot standbys; RTO of hours allows for backup-restore.

The DR test discipline: untested DR is theater. Run a quarterly DR drill: simulate a regional outage, fail over, measure actual RTO, restore, document gaps.

Capacity Planning

Capacity engineering is forecasting load and provisioning to meet it without over-spending. The discipline:

  • Track per-service RPS, p99 latency, resource utilisation over time.
  • Project growth from product roadmap and historic curves.
  • Identify bottleneck per service (CPU, RAM, connection pool, disk IOPS, downstream RPC).
  • Reserve capacity for traffic peaks (Black Friday, marketing events).
  • Set autoscaling boundaries that avoid surprises (HPA max not too low, not infinite).
  • Review monthly; update quarterly.

Cost Engineering

At scale, cloud spend becomes architectural. Three high-leverage levers:

  • Right-sizing: VPA recommendations or manual analysis. Most workloads request 2–3x what they use; cutting that is direct savings.
  • Spot / preemptible instances: 60–90% cheaper than on-demand. Use for batch, async, stateless web. Karpenter handles the eviction churn.
  • Reserved capacity / savings plans: 30–60% cheaper for committed baseline. Buy enough to cover the steady-state, on-demand for the rest.

Production Architecture Case Studies

Read three real production architectures and identify how they made every decision in this course:

  • Kafka at LinkedIn: trillions of events per day; partitioning by key, ZooKeeper (now KRaft) for metadata, MirrorMaker for cross-region.
  • Cassandra at Netflix: leaderless replication, multi-region with LOCAL_QUORUM, custom backup tooling, the original Chaos Monkey.
  • Uber's ringpop: SWIM gossip + consistent hashing for service partitioning.
  • Spanner at Google: globally-consistent SQL via TrueTime; the gold standard for multi-region strong consistency.

Capstone Project

Design and document a complete production distributed system. Your capstone should include:

  1. An architecture diagram showing every service, datastore, and external dependency.
  2. The communication choice per boundary (sync HTTP, async Kafka, mTLS, etc.).
  3. The data model and partitioning strategy per datastore.
  4. The replication strategy and consistency guarantees.
  5. The autoscaling and capacity plan.
  6. The reliability patterns (circuit breakers, timeouts, retries, degradation).
  7. The security architecture (workload identity, mTLS, authz).
  8. The observability stack (traces, metrics, logs, SLOs).
  9. The deployment architecture (CI/CD, regions, rollback strategy).
  10. The DR plan with RPO/RTO targets.

Defending the choices is the test. For each component you must be able to explain: why this and not the alternative. The answer is rarely the same for two systems — that is the discipline of system design.

Disaster Recovery Topology

DR TOPOLOGY — ACTIVE-PASSIVE WITH BACKUPPrimary regionApp tier (active)DB primaryBackup snapshotSecondary region (passive)App tier (warm standby)DB read replicaRestored from snapshotasync replication (RPO bound)snapshot copy (hourly)Failover: DNS / global LB redirects to secondary. RTO ≈ minutes. RPO ≈ replication lag.

Global Traffic Routing

GLOBAL TRAFFIC ROUTING (latency / geo-based)user-EUuser-USuser-APACGlobal LBRoute 53 / GCLB / Anycastlatency-based routingeu-west-1 regionus-east-1 regionap-southeast-1 regionHealth-check failover routes around dead regions; latency-based routing keeps users on the closest healthy region.

Self-Check Quiz

  1. Your business says “RPO=0, RTO=5 minutes” for the database. What does this require? (Answer: synchronous cross-region replication (or globally-consistent storage like Spanner) and automated failover. Both are expensive. Understand the cost before committing.)
  2. You design active-active across two regions for a payment system. Why is that risky? (Answer: split-brain on writes during partition. Active-active for payments needs sharded ownership (each shard active in one region) or globally-consistent storage.)
  3. Your DR drill takes 3 hours instead of the planned 30 minutes. What were the gaps? (Answer: typically — secrets restoration, DNS propagation, dependency-order startup, missing runbook for one component. Drills find these. Untested DR is theater.)
  4. Your monthly cloud bill grows 30% in a quarter. Two engineers spend a week analysing. What three levers usually pay back? (Answer: right-sizing requests, spot/preemptible for batch, reserved/savings plans for baseline. Each is 30-90% savings on the relevant bucket.)
  5. Why is the capstone exercise more valuable than another module of content? (Answer: production system design is a synthesis skill that transfers only with practice. The capstone forces you to apply every prior module's trade-offs to a single coherent architecture.)

Where to Go Next — Future Advanced Courses

This course gives you the foundations and operational fluency. Three directions for deeper specialisation, each available free on CodersSecret:

  • Mastering SPIFFE & SPIRE — 13 modules going deep on workload identity. The right next course after Module 8 (Distributed Security & Zero Trust) of this course.
  • Cloud Native Security Engineering — 16 modules on Kubernetes-native security. The right next course after Modules 8 and 10 of this course.
  • Production RAG Systems Engineering — 16 modules on AI-infrastructure-specific distributed systems patterns. The right next course if your distributed systems work is in AI/ML production.

For ongoing operational reference, the cheatsheets that align with this course: Kubernetes, Kubernetes Security, SPIFFE/SPIRE, OPA/Rego, API Security, Runtime Security, Service Mesh, DevSecOps.

Real-World Use Cases

  • Stripe runs active-passive multi-region for payment processing with sub-minute failover.
  • Google Spanner is the canonical example of globally-consistent SQL with TrueTime-bounded uncertainty.
  • Netflix runs active-active across AWS regions for the streaming control plane (read-heavy, eventually consistent).
  • AWS DynamoDB Global Tables provide multi-region active-active with last-writer-wins; useful for read-heavy global content.

Production Notes

  • Practice DR drills quarterly. Untested DR is theatre. Measure actual RTO; gap-fill before the real outage.
  • Capacity reviews monthly; full re-projection quarterly. Growth surprises do not have to be surprises.
  • Tag every cloud resource with cost-centre + service. Cost engineering needs attribution.
  • Buy reserved capacity for the steady-state baseline; on-demand for the burst; spot for batch.

Common Mistakes

  • Designing for &ldquo;active-active multi-region&rdquo; for write-heavy workloads without globally-consistent storage. Split brain on payments is catastrophic.
  • Confusing RPO and RTO. RPO is data loss tolerance; RTO is recovery time tolerance. Both need explicit SLOs.
  • Right-sizing once and forgetting. Workload behaviour drifts; right-sizing is a quarterly discipline.
  • No DR runbook for the dependency graph. Bringing services back in random order risks cascading retries on cold backends.

Security Risks to Watch

  • Multi-region implies cross-region replication; the wire is a new attack surface. Always TLS, ideally mTLS via SPIFFE federation.
  • DR backups in cold storage are still customer data; encrypt at rest with KMS, audit access.
  • Failover automation that bypasses change control becomes the attacker&apos;s lever (&ldquo;simulate failure to force failover&rdquo;).
  • Active-active with shared global IAM means one region&apos;s compromise = global compromise. Region-scoped credentials are safer.

Design Tradeoffs

Active-Passive multi-region

Pros

  • Simple consistency story
  • Bounded cost
  • Tested failover path

Cons

  • Standby capacity is &ldquo;wasted&rdquo;
  • Failover is an operation

Active-Active sharded

Pros

  • Each region serves local traffic
  • Clear ownership boundary

Cons

  • Cross-shard ops are expensive
  • Complex routing

Active-Active replicated (full data in every region)

Pros

  • Read locality everywhere

Cons

  • Conflict resolution required
  • Dangerous for writes-with-consequence without consistent storage

Globally-consistent (Spanner / CockroachDB)

Pros

  • Linearizable globally
  • Writes anywhere

Cons

  • Premium cost
  • Cross-region commit latency

Production Alternatives

  • Active-Passive multi-region: Simplest, bounded RTO/RPO; standby capacity costs.
  • Active-Active sharded: Each region owns a shard; standard pattern for global SaaS.
  • Active-Active replicated: Same data in every region; only safe with globally-consistent storage.
  • Edge-first (Cloudflare Workers, Lambda@Edge): Compute at the edge; lowest latency for global users.
  • Multi-cloud (avoid vendor lock-in): Highest operational cost; pays back if vendor risk is real.

Think Like an Engineer

  • Architecture is the discipline of saying no. Every &ldquo;yes&rdquo; to a feature locks in trade-offs that constrain the next year of decisions.
  • For every architectural choice, write down the alternative. If you cannot articulate the alternative, you do not understand your own choice.
  • When defending an architecture, frame it as &ldquo;I chose X because the trade-off was Y vs Z; here is which I optimised for and why&rdquo;. That framing is the difference between &ldquo;senior&rdquo; and &ldquo;staff&rdquo;.

Production Story

A team operated active-active across two regions for &ldquo;HA&rdquo; on their payments platform. A 4-minute partition between regions caused both sides to accept conflicting writes (same user charged twice from two regions). Reconciliation took 3 weeks of manual work. The redesign moved to active-passive with explicit failover and tested RTO &lt; 5 minutes. The engineering lesson: active-active for writes-with-consequence requires globally-consistent storage (Spanner/CockroachDB) or sharded ownership; never &ldquo;the same database in two regions&rdquo; with async replication.

Career Relevance

The capstone of this course aligns with the senior-to-staff engineering interview at most large companies: design a real production system, identify trade-offs, defend choices. Engineers who can produce a coherent, defensible architecture document and walk through it under questioning are the engineers who get trusted with the architecture role. The capstone exercise is your portfolio piece.

Key Terms

RPO
Recovery Point Objective; how much data you can afford to lose in a disaster.
RTO
Recovery Time Objective; how long recovery can take.
Active-Active
Multiple regions accept reads and writes simultaneously.
Active-Passive
One region is primary; others are standby replicas activated only on failover.
Capacity planning
Discipline of forecasting load and provisioning to meet it efficiently.

Hands-On Labs

  1. Lab 12.1 — Multi-Region Active-Active Demo

    Stand up a small active-active app across two regions; observe failover.

    120 minutes - Advanced

    • Deploy app + DB in two simulated regions (kind clusters)
    • Configure CRDT-style or LWW conflict resolution
    • Simulate partition; observe behaviour
    • Heal partition; verify convergence

    View lab files on GitHub

  2. Lab 12.2 — DR Drill

    Practice a full disaster recovery drill from backup.

    120 minutes - Advanced

    • Take a snapshot of a stateful service
    • Destroy the cluster
    • Restore from snapshot to a fresh cluster
    • Measure actual RTO; identify gaps

    View lab files on GitHub

  3. Lab 12.3 — Capstone Architecture Document

    Produce a complete production architecture for a distributed system of your choosing.

    4 hours - Advanced

    • Pick a domain (e-commerce, payments, analytics, social)
    • Document architecture per the 10-point list above
    • Defend each choice with an alternative + the trade-off you made
    • Submit to peer review

    View lab files on GitHub

Key Takeaways

  • Multi-region is hard &mdash; pick active-passive or active-active sharded for most workloads
  • RPO and RTO drive every DR decision; untested DR is theater
  • Capacity planning is forecasting + bottleneck analysis + autoscaling discipline
  • Cost engineering is architectural at scale; right-size, spot, reserved capacity
  • The capstone is the test: can you defend every architectural choice with the trade-off?