Multi-region architecture, disaster recovery, capacity planning, real production tradeoffs — and a capstone that integrates every module into one system.
-Design for cost — capacity reservations, autoscaling, regional placement
-Read production architecture diagrams (Kafka, Kubernetes, Netflix, Uber, Google) and identify the trade-offs
-Design a complete production system as the capstone and defend the choices
Before
-DR documented but never tested; first real failover is the test
-Single region; an AZ failure becomes a customer-impacting outage
-Cost grows linearly with users; no architectural levers pulled
-Capstone-level system design treated as senior interview hazing, not real skill
After
+DR drilled quarterly; actual RTO measured and improved
+Multi-region with explicit pattern (active-passive or sharded active-active)
+Cost engineering as ongoing discipline: right-sizing, spot, reservations
+Production architecture documented with defended trade-offs; portfolio-quality artefact
The capstone module ties every prior topic together. Production system design is the test of whether you can make CAP, replication, scaling, reliability, security, and observability cohere as a single coherent architecture. Most engineers can name the components; few can defend the choices.
Multi-Region Architecture
Three patterns:
Active-Passive: writes go to one region; standby region is a hot replica. Simple consistency story; failover is a deliberate, observable operation. RPO bounded by replication lag.
Active-Active sharded: each region owns a shard of the keyspace. Reads and writes local to that region. Cross-region traffic only for cross-shard operations. The right pattern for most globally-distributed systems.
Active-Active replicated: same data in every region; conflict resolution required. Useful for read-heavy global content (CDN-cached pages, profile reads). Dangerous for writes-with-consequence (payments) without globally-consistent storage like Spanner.
Disaster Recovery
Two metrics that drive every DR design:
RPO (Recovery Point Objective): how much data you can afford to lose. Driven by replication lag and backup cadence. RPO=0 requires synchronous cross-region replication (cost). RPO=1 hour means hourly backups suffice.
RTO (Recovery Time Objective): how long the recovery can take. Driven by failover automation, DNS TTL, warm-standby vs cold-restoration. RTO of minutes requires hot standbys; RTO of hours allows for backup-restore.
The DR test discipline: untested DR is theater. Run a quarterly DR drill: simulate a regional outage, fail over, measure actual RTO, restore, document gaps.
Capacity Planning
Capacity engineering is forecasting load and provisioning to meet it without over-spending. The discipline:
Track per-service RPS, p99 latency, resource utilisation over time.
Project growth from product roadmap and historic curves.
Identify bottleneck per service (CPU, RAM, connection pool, disk IOPS, downstream RPC).
Reserve capacity for traffic peaks (Black Friday, marketing events).
Set autoscaling boundaries that avoid surprises (HPA max not too low, not infinite).
Review monthly; update quarterly.
Cost Engineering
At scale, cloud spend becomes architectural. Three high-leverage levers:
Right-sizing: VPA recommendations or manual analysis. Most workloads request 2–3x what they use; cutting that is direct savings.
Spot / preemptible instances: 60–90% cheaper than on-demand. Use for batch, async, stateless web. Karpenter handles the eviction churn.
Reserved capacity / savings plans: 30–60% cheaper for committed baseline. Buy enough to cover the steady-state, on-demand for the rest.
Production Architecture Case Studies
Read three real production architectures and identify how they made every decision in this course:
Kafka at LinkedIn: trillions of events per day; partitioning by key, ZooKeeper (now KRaft) for metadata, MirrorMaker for cross-region.
Cassandra at Netflix: leaderless replication, multi-region with LOCAL_QUORUM, custom backup tooling, the original Chaos Monkey.
Uber's ringpop: SWIM gossip + consistent hashing for service partitioning.
Spanner at Google: globally-consistent SQL via TrueTime; the gold standard for multi-region strong consistency.
Capstone Project
Design and document a complete production distributed system. Your capstone should include:
An architecture diagram showing every service, datastore, and external dependency.
The communication choice per boundary (sync HTTP, async Kafka, mTLS, etc.).
The data model and partitioning strategy per datastore.
The replication strategy and consistency guarantees.
The autoscaling and capacity plan.
The reliability patterns (circuit breakers, timeouts, retries, degradation).
The security architecture (workload identity, mTLS, authz).
The observability stack (traces, metrics, logs, SLOs).
The deployment architecture (CI/CD, regions, rollback strategy).
The DR plan with RPO/RTO targets.
Defending the choices is the test. For each component you must be able to explain: why this and not the alternative. The answer is rarely the same for two systems — that is the discipline of system design.
Disaster Recovery Topology
Global Traffic Routing
Self-Check Quiz
Your business says “RPO=0, RTO=5 minutes” for the database. What does this require? (Answer: synchronous cross-region replication (or globally-consistent storage like Spanner) and automated failover. Both are expensive. Understand the cost before committing.)
You design active-active across two regions for a payment system. Why is that risky? (Answer: split-brain on writes during partition. Active-active for payments needs sharded ownership (each shard active in one region) or globally-consistent storage.)
Your DR drill takes 3 hours instead of the planned 30 minutes. What were the gaps? (Answer: typically — secrets restoration, DNS propagation, dependency-order startup, missing runbook for one component. Drills find these. Untested DR is theater.)
Your monthly cloud bill grows 30% in a quarter. Two engineers spend a week analysing. What three levers usually pay back? (Answer: right-sizing requests, spot/preemptible for batch, reserved/savings plans for baseline. Each is 30-90% savings on the relevant bucket.)
Why is the capstone exercise more valuable than another module of content? (Answer: production system design is a synthesis skill that transfers only with practice. The capstone forces you to apply every prior module's trade-offs to a single coherent architecture.)
Where to Go Next — Future Advanced Courses
This course gives you the foundations and operational fluency. Three directions for deeper specialisation, each available free on CodersSecret:
Mastering SPIFFE & SPIRE — 13 modules going deep on workload identity. The right next course after Module 8 (Distributed Security & Zero Trust) of this course.
Cloud Native Security Engineering — 16 modules on Kubernetes-native security. The right next course after Modules 8 and 10 of this course.
Production RAG Systems Engineering — 16 modules on AI-infrastructure-specific distributed systems patterns. The right next course if your distributed systems work is in AI/ML production.
-Stripe runs active-passive multi-region for payment processing with sub-minute failover.
-Google Spanner is the canonical example of globally-consistent SQL with TrueTime-bounded uncertainty.
-Netflix runs active-active across AWS regions for the streaming control plane (read-heavy, eventually consistent).
-AWS DynamoDB Global Tables provide multi-region active-active with last-writer-wins; useful for read-heavy global content.
Production notes
Keep these close
!Practice DR drills quarterly. Untested DR is theatre. Measure actual RTO; gap-fill before the real outage.
!Capacity reviews monthly; full re-projection quarterly. Growth surprises do not have to be surprises.
!Tag every cloud resource with cost-centre + service. Cost engineering needs attribution.
!Buy reserved capacity for the steady-state baseline; on-demand for the burst; spot for batch.
Common mistakes
What usually breaks
!Designing for “active-active multi-region” for write-heavy workloads without globally-consistent storage. Split brain on payments is catastrophic.
!Confusing RPO and RTO. RPO is data loss tolerance; RTO is recovery time tolerance. Both need explicit SLOs.
!Right-sizing once and forgetting. Workload behaviour drifts; right-sizing is a quarterly discipline.
!No DR runbook for the dependency graph. Bringing services back in random order risks cascading retries on cold backends.
Security risks
Threats to watch
!Multi-region implies cross-region replication; the wire is a new attack surface. Always TLS, ideally mTLS via SPIFFE federation.
!DR backups in cold storage are still customer data; encrypt at rest with KMS, audit access.
!Failover automation that bypasses change control becomes the attacker's lever (“simulate failure to force failover”).
!Active-active with shared global IAM means one region's compromise = global compromise. Region-scoped credentials are safer.
Tradeoffs
Design choices you should be able to defend
Active-Passive multi-region
Pros
+Simple consistency story
+Bounded cost
+Tested failover path
Cons
-Standby capacity is “wasted”
-Failover is an operation
Active-Active sharded
Pros
+Each region serves local traffic
+Clear ownership boundary
Cons
-Cross-shard ops are expensive
-Complex routing
Active-Active replicated (full data in every region)
Pros
+Read locality everywhere
Cons
-Conflict resolution required
-Dangerous for writes-with-consequence without consistent storage
Each region owns a shard; standard pattern for global SaaS.
Active-Active replicated
Same data in every region; only safe with globally-consistent storage.
Edge-first (Cloudflare Workers, Lambda@Edge)
Compute at the edge; lowest latency for global users.
Multi-cloud (avoid vendor lock-in)
Highest operational cost; pays back if vendor risk is real.
Think like an engineer
Questions to answer before shipping
?Architecture is the discipline of saying no. Every “yes” to a feature locks in trade-offs that constrain the next year of decisions.
?For every architectural choice, write down the alternative. If you cannot articulate the alternative, you do not understand your own choice.
?When defending an architecture, frame it as “I chose X because the trade-off was Y vs Z; here is which I optimised for and why”. That framing is the difference between “senior” and “staff”.
Key terms
Vocabulary used in this module
RPO
Recovery Point Objective; how much data you can afford to lose in a disaster.
RTO
Recovery Time Objective; how long recovery can take.
Active-Active
Multiple regions accept reads and writes simultaneously.
Active-Passive
One region is primary; others are standby replicas activated only on failover.
Capacity planning
Discipline of forecasting load and provisioning to meet it efficiently.
Labs
Hands-on labs
1
120 minutesAdvanced
Lab 12.1 — Multi-Region Active-Active Demo
Stand up a small active-active app across two regions; observe failover.
-Deploy app + DB in two simulated regions (kind clusters)