Module 12 of 12

Production System Design & Capstone

Multi-region architecture, disaster recovery, capacity planning, real production tradeoffs - and a capstone that integrates every module into one system.

6 hours3 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Design a multi-region architecture and reason about its failure modes
Plan disaster recovery: RPO, RTO, runbooks, tested restoration
Design for cost - capacity reservations, autoscaling, regional placement
Read production architecture diagrams (Kafka, Kubernetes, Netflix, Uber, Google) and identify the trade-offs
Design a complete production system as the capstone and defend the choices

Before

DR documented but never tested; first real failover is the test
Single region; an AZ failure becomes a customer-impacting outage
Cost grows linearly with users; no architectural levers pulled
Capstone-level system design treated as senior interview hazing, not real skill

After

DR drilled quarterly; actual RTO measured and improved
Multi-region with explicit pattern (active-passive or sharded active-active)
Cost engineering as ongoing discipline: right-sizing, spot, reservations
Production architecture documented with defended trade-offs; portfolio-quality artefact

The capstone module ties every prior topic together. Production system design is the test of whether you can make CAP, replication, scaling, reliability, security, and observability cohere as a single coherent architecture. Most engineers can name the components; few can defend the choices.

Multi-Region Architecture

Three patterns:

Active-Passive: writes go to one region; standby region is a hot replica. Simple consistency story; failover is a deliberate, observable operation. RPO bounded by replication lag.
Active-Active sharded: each region owns a shard of the keyspace. Reads and writes local to that region. Cross-region traffic only for cross-shard operations. The right pattern for most globally-distributed systems.
Active-Active replicated: same data in every region; conflict resolution required. Useful for read-heavy global content (CDN-cached pages, profile reads). Dangerous for writes-with-consequence (payments) without globally-consistent storage like Spanner.

Disaster Recovery

Two metrics that drive every DR design:

RPO (Recovery Point Objective): how much data you can afford to lose. Driven by replication lag and backup cadence. RPO=0 requires synchronous cross-region replication (cost). RPO=1 hour means hourly backups suffice.
RTO (Recovery Time Objective): how long the recovery can take. Driven by failover automation, DNS TTL, warm-standby vs cold-restoration. RTO of minutes requires hot standbys; RTO of hours allows for backup-restore.

The DR test discipline: untested DR is theater. Run a quarterly DR drill: simulate a regional outage, fail over, measure actual RTO, restore, document gaps.

Capacity Planning

Capacity engineering is forecasting load and provisioning to meet it without over-spending. The discipline:

Track per-service RPS, p99 latency, resource utilisation over time.
Project growth from product roadmap and historic curves.
Identify bottleneck per service (CPU, RAM, connection pool, disk IOPS, downstream RPC).
Reserve capacity for traffic peaks (Black Friday, marketing events).
Set autoscaling boundaries that avoid surprises (HPA max not too low, not infinite).
Review monthly; update quarterly.

Cost Engineering

At scale, cloud spend becomes architectural. Three high-leverage levers:

Right-sizing: VPA recommendations or manual analysis. Most workloads request 2–3x what they use; cutting that is direct savings.
Spot / preemptible instances: 60–90% cheaper than on-demand. Use for batch, async, stateless web. Karpenter handles the eviction churn.
Reserved capacity / savings plans: 30–60% cheaper for committed baseline. Buy enough to cover the steady-state, on-demand for the rest.

Production Architecture Case Studies

Read three real production architectures and identify how they made every decision in this course:

Kafka at LinkedIn: trillions of events per day; partitioning by key, ZooKeeper (now KRaft) for metadata, MirrorMaker for cross-region.
Cassandra at Netflix: leaderless replication, multi-region with LOCAL_QUORUM, custom backup tooling, the original Chaos Monkey.
Uber's ringpop: SWIM gossip + consistent hashing for service partitioning.
Spanner at Google: globally-consistent SQL via TrueTime; the gold standard for multi-region strong consistency.

Capstone Project

Design and document a complete production distributed system. Your capstone should include:

An architecture diagram showing every service, datastore, and external dependency.
The communication choice per boundary (sync HTTP, async Kafka, mTLS, etc.).
The data model and partitioning strategy per datastore.
The replication strategy and consistency guarantees.
The autoscaling and capacity plan.
The reliability patterns (circuit breakers, timeouts, retries, degradation).
The security architecture (workload identity, mTLS, authz).
The observability stack (traces, metrics, logs, SLOs).
The deployment architecture (CI/CD, regions, rollback strategy).
The DR plan with RPO/RTO targets.

Defending the choices is the test. For each component you must be able to explain: why this and not the alternative. The answer is rarely the same for two systems - that is the discipline of system design.

Disaster Recovery Topology

Global Traffic Routing

Self-Check Quiz

Your business says “RPO=0, RTO=5 minutes” for the database. What does this require? (Answer: synchronous cross-region replication (or globally-consistent storage like Spanner) and automated failover. Both are expensive. Understand the cost before committing.)
You design active-active across two regions for a payment system. Why is that risky? (Answer: split-brain on writes during partition. Active-active for payments needs sharded ownership (each shard active in one region) or globally-consistent storage.)
Your DR drill takes 3 hours instead of the planned 30 minutes. What were the gaps? (Answer: typically - secrets restoration, DNS propagation, dependency-order startup, missing runbook for one component. Drills find these. Untested DR is theater.)
Your monthly cloud bill grows 30% in a quarter. Two engineers spend a week analysing. What three levers usually pay back? (Answer: right-sizing requests, spot/preemptible for batch, reserved/savings plans for baseline. Each is 30-90% savings on the relevant bucket.)
Why is the capstone exercise more valuable than another module of content? (Answer: production system design is a synthesis skill that transfers only with practice. The capstone forces you to apply every prior module's trade-offs to a single coherent architecture.)

Where to Go Next - Future Advanced Courses

This course gives you the foundations and operational fluency. Three directions for deeper specialisation, each available free on CodersSecret:

Mastering SPIFFE & SPIRE - 13 modules going deep on workload identity. The right next course after Module 8 (Distributed Security & Zero Trust) of this course.
Cloud Native Security Engineering - 16 modules on Kubernetes-native security. The right next course after Modules 8 and 10 of this course.
Production RAG Systems Engineering - 16 modules on AI-infrastructure-specific distributed systems patterns. The right next course if your distributed systems work is in AI/ML production.

For ongoing operational reference, the cheatsheets that align with this course: Kubernetes, Kubernetes Security, SPIFFE/SPIRE, OPA/Rego, API Security, Runtime Security, Service Mesh, DevSecOps.

Real world

Where this shows up

Stripe runs active-passive multi-region for payment processing with sub-minute failover.
Google Spanner is the canonical example of globally-consistent SQL with TrueTime-bounded uncertainty.
Netflix runs active-active across AWS regions for the streaming control plane (read-heavy, eventually consistent).
AWS DynamoDB Global Tables provide multi-region active-active with last-writer-wins; useful for read-heavy global content.

Production notes

Keep these close

Practice DR drills quarterly. Untested DR is theatre. Measure actual RTO; gap-fill before the real outage.
Capacity reviews monthly; full re-projection quarterly. Growth surprises do not have to be surprises.
Tag every cloud resource with cost-centre + service. Cost engineering needs attribution.
Buy reserved capacity for the steady-state baseline; on-demand for the burst; spot for batch.

Common mistakes

What usually breaks

Designing for “active-active multi-region” for write-heavy workloads without globally-consistent storage. Split brain on payments is catastrophic.
Confusing RPO and RTO. RPO is data loss tolerance; RTO is recovery time tolerance. Both need explicit SLOs.
Right-sizing once and forgetting. Workload behaviour drifts; right-sizing is a quarterly discipline.
No DR runbook for the dependency graph. Bringing services back in random order risks cascading retries on cold backends.

Security risks

Threats to watch

Multi-region implies cross-region replication; the wire is a new attack surface. Always TLS, ideally mTLS via SPIFFE federation.
DR backups in cold storage are still customer data; encrypt at rest with KMS, audit access.
Failover automation that bypasses change control becomes the attacker's lever (“simulate failure to force failover”).
Active-active with shared global IAM means one region's compromise = global compromise. Region-scoped credentials are safer.

Tradeoffs

Design choices you should be able to defend

Active-Passive multi-region

Pros

Simple consistency story
Bounded cost
Tested failover path

Cons

Standby capacity is “wasted”
Failover is an operation

Active-Active sharded

Pros

Each region serves local traffic
Clear ownership boundary

Cons

Cross-shard ops are expensive
Complex routing

Active-Active replicated (full data in every region)

Pros

Read locality everywhere

Cons

Conflict resolution required
Dangerous for writes-with-consequence without consistent storage

Globally-consistent (Spanner / CockroachDB)

Pros

Linearizable globally
Writes anywhere

Cons

Premium cost
Cross-region commit latency

Alternatives

Other production approaches

Active-Passive multi-region

Simplest, bounded RTO/RPO; standby capacity costs.

Active-Active sharded

Each region owns a shard; standard pattern for global SaaS.

Active-Active replicated

Same data in every region; only safe with globally-consistent storage.

Edge-first (Cloudflare Workers, Lambda@Edge)

Compute at the edge; lowest latency for global users.

Multi-cloud (avoid vendor lock-in)

Highest operational cost; pays back if vendor risk is real.

Think like an engineer

Questions to answer before shipping

Architecture is the discipline of saying no. Every “yes” to a feature locks in trade-offs that constrain the next year of decisions.
For every architectural choice, write down the alternative. If you cannot articulate the alternative, you do not understand your own choice.
When defending an architecture, frame it as “I chose X because the trade-off was Y vs Z; here is which I optimised for and why”. That framing is the difference between “senior” and “staff”.

Key terms

Vocabulary used in this module

RPO

Recovery Point Objective; how much data you can afford to lose in a disaster.

RTO

Recovery Time Objective; how long recovery can take.

Active-Active

Multiple regions accept reads and writes simultaneously.

Active-Passive

One region is primary; others are standby replicas activated only on failover.

Capacity planning

Discipline of forecasting load and provisioning to meet it efficiently.

Labs

Hands-on labs

120 minutesAdvanced

Lab 12.1 - Multi-Region Active-Active Demo

Stand up a small active-active app across two regions; observe failover.

Deploy app + DB in two simulated regions (kind clusters)
Configure CRDT-style or LWW conflict resolution
Simulate partition; observe behaviour
Heal partition; verify convergence

View lab on GitHub

120 minutesAdvanced

Lab 12.2 - DR Drill

Practice a full disaster recovery drill from backup.

Take a snapshot of a stateful service
Destroy the cluster
Restore from snapshot to a fresh cluster
Measure actual RTO; identify gaps

View lab on GitHub

4 hoursAdvanced

Lab 12.3 - Capstone Architecture Document

Produce a complete production architecture for a distributed system of your choosing.

Pick a domain (e-commerce, payments, analytics, social)
Document architecture per the 10-point list above
Defend each choice with an alternative + the trade-off you made
Submit to peer review

View lab on GitHub

Recap

Key takeaways

Multi-region is hard - pick active-passive or active-active sharded for most workloads
RPO and RTO drive every DR decision; untested DR is theater
Capacity planning is forecasting + bottleneck analysis + autoscaling discipline
Cost engineering is architectural at scale; right-size, spot, reserved capacity
The capstone is the test: can you defend every architectural choice with the trade-off?

Related resources

Production System Design & Capstone

Learning objectives

Multi-Region Architecture

Disaster Recovery

Capacity Planning

Cost Engineering

Production Architecture Case Studies

Capstone Project

Disaster Recovery Topology

Global Traffic Routing

Self-Check Quiz

Where to Go Next - Future Advanced Courses

Where this shows up

Keep these close

What usually breaks

Threats to watch

Design choices you should be able to defend

Active-Passive multi-region

Active-Active sharded

Active-Active replicated (full data in every region)

Globally-consistent (Spanner / CockroachDB)

Other production approaches

Active-Passive multi-region

Active-Active sharded

Active-Active replicated

Edge-first (Cloudflare Workers, Lambda@Edge)

Multi-cloud (avoid vendor lock-in)

Questions to answer before shipping

Vocabulary used in this module

RPO

RTO

Active-Active

Active-Passive

Capacity planning

Hands-on labs

Lab 12.1 - Multi-Region Active-Active Demo

Lab 12.2 - DR Drill

Lab 12.3 - Capstone Architecture Document

Key takeaways

Keep learning across CodersSecret

Related guides

Cheatsheets

Interactive labs

Glossary terms