Skip to main content

Module 1: Foundations of Distributed Systems Slides

Slide walkthrough for Module 1 of Distributed Systems Engineering: Building Scalable, Reliable & Secure Systems: What a distributed system actually is,...

This slide page is the visual review companion for the full course module. Use it to recap the architecture, examples, exercises, production warnings, and takeaways after reading the lesson.

Slide Outline

  1. Foundations of Distributed Systems - What a distributed system actually is, why we build them, and the trade-offs that define every design decision after this point.
  2. Learning Objectives - 5 outcomes for this module
  3. Why This Module Matters - Every senior engineer who works on production systems eventually owns or designs a distributed component. The engineers
  4. Before vs After - The operational shift this module teaches
  5. Why Distribute? The Real Reasons - Lesson section from the full module
  6. The CAP Theorem - A Decision Tool, Not a Theorem - Lesson section from the full module
  7. Latency - The Tax You Pay - Lesson section from the full module
  8. Availability and the “Nines” - Lesson section from the full module
  9. Fault Tolerance - Designing for “When”, Not “If” - Lesson section from the full module
  10. Consistency Models - What You Promise - Lesson section from the full module
  11. How This Course Is Structured - Lesson section from the full module
  12. The CAP Triangle Visualised - Lesson section from the full module
  13. Real-World Use Cases - Stripe runs critical payment paths as a monolith with microservice satellites — chose simplicity for the money path, distribution for the periphery., Shopify operates a Rails “majestic monolith” for the storefront with carved-out services for checkout and search; the architecture is a deliberate trade-off, not the result of an accident.
  14. Common Mistakes to Avoid - 3 mistakes covered
  15. Production Notes - 3 practical notes
  16. Security Risks to Watch - 3 risks covered
  17. Hands-On Labs - 3 hands-on labs
  18. Key Takeaways - 5 points to remember

Learning Objectives

  • Define a distributed system from a production-engineering perspective
  • Understand why distributed systems replace monoliths and what it costs you
  • Internalise CAP and PACELC as decision frameworks, not academic theorems
  • Reason about latency, availability, fault tolerance, and consistency as a coupled system
  • Build the mental model that every later module depends on

Why This Module Matters

Every senior engineer who works on production systems eventually owns or designs a distributed component. The engineers who succeed are the ones who internalise these foundations early — CAP, latency math, availability math, partial-failure thinking — and use them as a decision framework. The engineers who skip the foundations end up reinventing distributed databases badly and debugging the same outage classes for years. This module is the lens you carry into every later module.

Production Notes

  • Track a critical-path availability dashboard (multiply each dependency's SLO) so the org sees the math, not the wishful thinking.
  • Every cross-service call gets a timeout. Default: <em>do not let your services have unbounded patience</em>. Specific timeouts are part of every service contract.
  • When you run the availability math, the answer always points at one or two services. That is your investment list, not a hypothetical.

Common Mistakes

  • Adopting microservices because &ldquo;everyone else does&rdquo; before measuring whether failure isolation, independent scaling, or team autonomy actually justify the operational tax.
  • Treating CAP as a textbook quiz question rather than a runtime decision &mdash; the question is &ldquo;during a real partition, what should this service do?&rdquo;
  • Assuming dependencies have advertised availability when measured availability is materially different.

Key Takeaways

  • A distributed system exists for failure isolation, independent scaling, and team autonomy &mdash; not just &ldquo;scale&rdquo;
  • CAP/PACELC frame the trade-offs you must make consciously; pretending otherwise leads to surprise outages
  • Latency is set by topology before it is set by code &mdash; bad architecture cannot be tuned
  • Effective availability is the product of dependency availabilities &mdash; mind your critical path
  • Plan for partial failure as a normal operating mode, not an emergency

Hands-On Labs

  1. Lab 1.1 — Latency Simulation Across Service Boundaries

    Measure how cross-service network hops accumulate latency in a real microservice topology.

    45 minutes - Beginner

    • Spin up 5 small services (Go or Python) on docker-compose
    • Wire them in a chain: A → B → C → D → E
    • Add 5ms artificial latency per hop
    • Send 1000 requests through the chain and record p50/p95/p99
    • Compare to a single-monolith implementation
    • Plot the cumulative latency

    View lab files on GitHub

  2. Lab 1.2 — Failure Isolation Test

    Demonstrate failure isolation: kill one microservice and observe how the system degrades vs how a monolith fails.

    45 minutes - Beginner

    • Use the same 5-service chain from Lab 1.1
    • Add graceful degradation in service A: if D fails, return cached or partial response
    • Send traffic, kill service D mid-flight
    • Observe error rate, response codes, response shape
    • Repeat with the monolithic implementation

    View lab files on GitHub

  3. Lab 1.3 — Availability Math

    Calculate end-to-end availability for a real architecture and identify the weakest link.

    30 minutes - Beginner

    • Document a real microservice architecture you operate (or a fictional one with 6 services)
    • Assign each service its measured or claimed availability
    • Compute end-to-end availability for the critical path
    • Identify the single change that would most improve overall SLO
    • Document the find-and-fix recommendation

    View lab files on GitHub

Read the full module | Back to course curriculum