Module 12: Production System Design & Capstone Slides
Slide walkthrough for Module 12 of Distributed Systems Engineering: Building Scalable, Reliable & Secure Systems: Multi-region architecture, disaster...
This slide page is the visual review companion for the full course module. Use it to recap the architecture, examples, exercises, production warnings, and takeaways after reading the lesson.
Slide Outline
- Production System Design & Capstone - Multi-region architecture, disaster recovery, capacity planning, real production tradeoffs — and a capstone that integrates every module into one system.
- Learning Objectives - 5 outcomes for this module
- Why This Module Matters - This is the module that proves you can do system design at the senior / staff level. Anyone can list components; the eng
- Before vs After - The operational shift this module teaches
- Multi-Region Architecture - Lesson section from the full module
- Disaster Recovery - Lesson section from the full module
- Capacity Planning - Lesson section from the full module
- Cost Engineering - Lesson section from the full module
- Production Architecture Case Studies - Lesson section from the full module
- Capstone Project - Lesson section from the full module
- Disaster Recovery Topology - Lesson section from the full module
- Global Traffic Routing - Lesson section from the full module
- Real-World Use Cases - Stripe runs active-passive multi-region for payment processing with sub-minute failover., Google Spanner is the canonical example of globally-consistent SQL with TrueTime-bounded uncertainty.
- Common Mistakes to Avoid - 4 mistakes covered
- Production Notes - 4 practical notes
- Security Risks to Watch - 4 risks covered
- Hands-On Labs - 3 hands-on labs
- Key Takeaways - 5 points to remember
Learning Objectives
- Design a multi-region architecture and reason about its failure modes
- Plan disaster recovery: RPO, RTO, runbooks, tested restoration
- Design for cost — capacity reservations, autoscaling, regional placement
- Read production architecture diagrams (Kafka, Kubernetes, Netflix, Uber, Google) and identify the trade-offs
- Design a complete production system as the capstone and defend the choices
Why This Module Matters
This is the module that proves you can do system design at the senior / staff level. Anyone can list components; the engineer who can <em>defend</em> the choices — explain why this database and not that one, this consistency model and not the next, this region pattern and not the alternative — is the engineer who gets trusted with the architecture role. The capstone is your portfolio.
Production Notes
- Practice DR drills quarterly. Untested DR is theatre. Measure actual RTO; gap-fill before the real outage.
- Capacity reviews monthly; full re-projection quarterly. Growth surprises do not have to be surprises.
- Tag every cloud resource with cost-centre + service. Cost engineering needs attribution.
- Buy reserved capacity for the steady-state baseline; on-demand for the burst; spot for batch.
Common Mistakes
- Designing for “active-active multi-region” for write-heavy workloads without globally-consistent storage. Split brain on payments is catastrophic.
- Confusing RPO and RTO. RPO is data loss tolerance; RTO is recovery time tolerance. Both need explicit SLOs.
- Right-sizing once and forgetting. Workload behaviour drifts; right-sizing is a quarterly discipline.
- No DR runbook for the dependency graph. Bringing services back in random order risks cascading retries on cold backends.
Key Takeaways
- Multi-region is hard — pick active-passive or active-active sharded for most workloads
- RPO and RTO drive every DR decision; untested DR is theater
- Capacity planning is forecasting + bottleneck analysis + autoscaling discipline
- Cost engineering is architectural at scale; right-size, spot, reserved capacity
- The capstone is the test: can you defend every architectural choice with the trade-off?
Hands-On Labs
-
Lab 12.1 — Multi-Region Active-Active Demo
Stand up a small active-active app across two regions; observe failover.
120 minutes - Advanced
- Deploy app + DB in two simulated regions (kind clusters)
- Configure CRDT-style or LWW conflict resolution
- Simulate partition; observe behaviour
- Heal partition; verify convergence
-
Lab 12.2 — DR Drill
Practice a full disaster recovery drill from backup.
120 minutes - Advanced
- Take a snapshot of a stateful service
- Destroy the cluster
- Restore from snapshot to a fresh cluster
- Measure actual RTO; identify gaps
-
Lab 12.3 — Capstone Architecture Document
Produce a complete production architecture for a distributed system of your choosing.
4 hours - Advanced
- Pick a domain (e-commerce, payments, analytics, social)
- Document architecture per the 10-point list above
- Defend each choice with an alternative + the trade-off you made
- Submit to peer review