Module 1: Foundations of Distributed Systems
What a distributed system actually is, why we build them, and the trade-offs that define every design decision after this point.
3 hours. 3 hands-on labs. Free course module.
Learning Objectives
- Define a distributed system from a production-engineering perspective
- Understand why distributed systems replace monoliths and what it costs you
- Internalise CAP and PACELC as decision frameworks, not academic theorems
- Reason about latency, availability, fault tolerance, and consistency as a coupled system
- Build the mental model that every later module depends on
Why This Matters
Every senior engineer who works on production systems eventually owns or designs a distributed component. The engineers who succeed are the ones who internalise these foundations early — CAP, latency math, availability math, partial-failure thinking — and use them as a decision framework. The engineers who skip the foundations end up reinventing distributed databases badly and debugging the same outage classes for years. This module is the lens you carry into every later module.
Lesson Content
A distributed system is not multiple servers. A distributed system is what you get when failure of one component should not equal failure of the whole, when independent teams need to ship without coordinating every release, and when one machine is no longer enough to handle the load. Everything else — the consensus protocols, the service meshes, the observability pipelines — is mechanical detail that exists because we made the foundational choice to spread state and computation across many machines.
This module sets the mental model that every later module depends on. By the end you should be able to read a system architecture and name the trade-offs the designer made, predict the failure modes from the topology alone, and decide for any given service whether distribution is the right call or premature complexity.
Why Distribute? The Real Reasons
The standard answer is “scale”. The honest answer is more nuanced. Real production teams move from monolith to distributed for one or more of:
- Failure isolation — if the recommendation service crashes, the checkout flow should still work. A monolith dies as one process; distributed services degrade independently.
- Independent deploys — a 200-engineer org cannot rally around a single deploy train. Microservices let teams ship without lockstep coordination.
- Independent scaling — the search service may need 10x compute while the user-profile service needs 1x. A monolith forces them to scale together.
- Geographic distribution — users in Singapore expect low latency from Singapore. A single-region monolith cannot serve global traffic well.
- Heterogeneous storage — one service needs Postgres, another needs Redis, a third needs S3. Distribution lets each pick its store.
The cost ledger is real too. Every distributed boundary introduces latency, partial failure, network unreliability, debugging complexity, deployment coordination, and observability work that did not exist in the monolith. Distribute when one of the reasons above outweighs the operational tax — and not before.
The CAP Theorem — A Decision Tool, Not a Theorem
Eric Brewer's CAP theorem (formalised by Gilbert & Lynch in 2002) says: in a system with replication, you can have at most two of Consistency (every read sees the latest write), Availability (every request gets a non-error response), and Partition tolerance (the system continues operating across network partitions).
Network partitions are inevitable in real production environments — cables get cut, NICs fail, packet loss spikes during deploys. So you do not get to opt out of P. The actual question CAP forces is: during a partition, would you rather refuse writes (preserve consistency) or accept potentially stale data (preserve availability)?
- CP systems (etcd, ZooKeeper, Spanner): refuse writes during a partition rather than diverge. Used for control-plane state, leader election, configuration.
- AP systems (Cassandra, DynamoDB default, Riak): keep accepting reads/writes; reconcile divergent replicas later. Used for high-availability data planes.
The PACELC extension (Daniel Abadi, 2010) sharpens the picture: even when there is no Partition, you trade off between Latency and Consistency. Spanner is CP/EC (strict consistency at the cost of cross-region latency). DynamoDB is AP/EL by default (low latency at the cost of eventual consistency). The honest decision framework is PACELC, not just CAP.
Latency — The Tax You Pay
The numbers every distributed-systems engineer should know:
- L1 cache reference: ~0.5ns
- Main memory reference: ~100ns (200x slower than L1)
- SSD random read: ~150µs
- Network round-trip same-DC: ~0.5ms
- Network round-trip same-region: ~1–5ms
- Network round-trip cross-continent: ~80–200ms
Every microservice boundary you cross is at least 0.5ms in the same DC. Every cross-region call is 100ms. A user-facing request that passes through 8 services, hits a cross-region database, and waits on a cache miss can easily reach 500ms even if every service is healthy. The architecture determines the latency floor; you cannot tune your way out of bad topology.
Availability and the “Nines”
Availability is typically expressed as a percentage of uptime over a window. The famous “nines” ladder:
- 99% (two nines) ⇒ 3.65 days of downtime per year
- 99.9% (three nines) ⇒ 8.76 hours per year
- 99.99% (four nines) ⇒ ~52 minutes per year
- 99.999% (five nines) ⇒ ~5.26 minutes per year
Two practical realities. First, claimed availability rarely matches measured availability — cloud providers exclude maintenance windows, regional issues, and certain failure modes. Second, the dependency math is brutal: a service that depends on five 99.9% services has availability of 0.999^5 = 99.5%. Independent dependencies multiply, and your effective SLO is bounded by your weakest critical path.
Fault Tolerance — Designing for “When”, Not “If”
Fault tolerance is the property that the system continues to operate (perhaps at reduced capacity) even when some components fail. The standard tools:
- Redundancy — multiple replicas behind a load balancer; if one dies, others take over.
- Timeouts — do not wait forever for a dead dependency. Always set a timeout and have a fallback.
- Retries with exponential backoff and jitter — a failed call is retried, but with increasing delay (and randomness) so you do not hammer a recovering service.
- Circuit breakers — after N consecutive failures, stop calling the failing dependency for a window so it can recover.
- Bulkheads — isolate workloads so noisy neighbours cannot starve critical paths.
- Graceful degradation — when a non-critical dependency fails, return reduced functionality rather than full error.
Modules 6 and 7 cover these patterns in depth. For now, the key idea: plan for partial failure as a normal mode of operation, not an emergency.
Consistency Models — What You Promise
From strongest to weakest:
- Linearizable — every operation appears to happen at a single instant; reads see the latest write globally. Spanner, etcd Raft.
- Sequential consistency — all clients see the same order of operations, not necessarily wall-clock order. Single-leader DBs.
- Causal consistency — causally related operations are seen in causal order; concurrent operations may reorder. Common in collaborative editing.
- Read-your-writes — a client sees its own writes; other clients may lag.
- Eventual consistency — replicas converge if writes stop. The weakest useful guarantee. DynamoDB default.
The right consistency depends on the operation: a balance check needs strong consistency; a “number of likes” can tolerate eventual. Many systems offer tunable consistency at query time (Cassandra CL=QUORUM vs CL=ONE; MongoDB readConcern; DynamoDB ConsistentRead).
How This Course Is Structured
The next 11 modules walk you through every layer of a real distributed system:
- Modules 2–3: how services talk (networking, gRPC, events, Kafka).
- Modules 4–5: how data is split, replicated, and agreed upon (replication, sharding, consensus).
- Modules 6–7: how systems handle scale and failure (autoscaling, circuit breakers, chaos).
- Module 8: the security primitives that hold modern distributed systems together (Zero Trust, mTLS, SPIFFE).
- Module 9: how you observe and debug distributed systems (tracing, metrics, logs).
- Module 10: how Kubernetes changes everything.
- Module 11: real failure scenarios you will see in production.
- Module 12: how to design end-to-end production systems.
For deeper foundational reading, the Distributed Systems Algorithms guide goes into Raft/Paxos, quorum math, vector clocks, and CRDTs at the algorithm level. For hands-on practice, the Kubernetes Security Simulator exercises the operational decisions that matter from Module 8 onwards. Reach for the Kubernetes cheatsheet when you need a fast operational reference.
The CAP Triangle Visualised
Latency Propagation Across Service Boundaries
Self-Check Quiz
- You operate a payment system with 5 microservices on the critical path, each at 99.95% availability. What is your effective availability? (Answer: 0.9995^5 ≈ 99.75%, or ~22 hours of downtime per year. The math always points at one or two services as your investment.)
- During a network partition, your CP database refuses writes. The product team asks if you can "just keep accepting writes and reconcile later." How do you frame the trade-off? (Answer: that is exactly the AP choice; it requires designed conflict resolution — LWW, CRDTs, or human reconciliation. CP and AP are not interchangeable mid-flight.)
- A user complains that the recommendation panel sometimes shows stale recommendations after they update preferences. What are your three options? (Answer: linearizable reads from the leader; read-your-writes via session affinity; cache invalidation on update with a short TTL fallback.)
- Why is “just add more servers” the wrong first move when a service is slow? (Answer: scaling pushes the bottleneck downstream; if the database connection pool is exhausted, more replicas just exhaust it faster. Identify the bottleneck before scaling.)
Real-World Use Cases
- Stripe runs critical payment paths as a monolith with microservice satellites — chose simplicity for the money path, distribution for the periphery.
- Shopify operates a Rails “majestic monolith” for the storefront with carved-out services for checkout and search; the architecture is a deliberate trade-off, not the result of an accident.
- Amazon's famous “two-pizza team” rule was as much about deployment isolation (each team owns its services end-to-end) as about scaling.
- Segment publicly migrated FROM microservices BACK to a monolith for parts of their stack when the operational complexity outweighed the benefits.
Production Notes
- Track a critical-path availability dashboard (multiply each dependency's SLO) so the org sees the math, not the wishful thinking.
- Every cross-service call gets a timeout. Default: <em>do not let your services have unbounded patience</em>. Specific timeouts are part of every service contract.
- When you run the availability math, the answer always points at one or two services. That is your investment list, not a hypothetical.
Common Mistakes
- Adopting microservices because “everyone else does” before measuring whether failure isolation, independent scaling, or team autonomy actually justify the operational tax.
- Treating CAP as a textbook quiz question rather than a runtime decision — the question is “during a real partition, what should this service do?”
- Assuming dependencies have advertised availability when measured availability is materially different.
Security Risks to Watch
- Every distributed boundary is a new attack surface. Service-to-service calls without authn become trust-by-network-position, the model that Zero Trust replaces.
- Distributed deployments multiply credential management cost. Long-lived shared secrets in env vars across services become impossible to rotate.
- Failure-isolation only works if security is also isolated. A compromised auth service in a 5-service architecture must not give the attacker the keys to all five.
Design Tradeoffs
Distributed (microservices)
Pros
- Failure isolation
- Independent deploys
- Independent scaling per service
- Team autonomy at scale
Cons
- Network latency on every boundary
- Operational complexity (observability, deployment, security)
- Distributed-systems failure modes (split brain, partial failure)
- Higher cloud cost
Monolith
Pros
- In-process calls (~ns latency)
- Simple operational model
- Atomic transactions across the entire app
- Lower compute cost
Cons
- Single failure domain
- Lockstep deploys
- Vertical scaling only
- Coordination tax across teams
Modular monolith
Pros
- Most monolith benefits + clean module boundaries
- Refactor-able into microservices later
- Lowest operational cost for early-stage products
Cons
- Module boundaries enforced by discipline, not infrastructure
- Still a single deploy unit
Production Alternatives
- Modular monolith: Single deploy unit with strict module boundaries. Best for early-stage products before team scale forces distribution.
- Microservices on Kubernetes: Industry default once 50+ engineers; pays the operational tax for team-autonomy benefits.
- Service-Oriented Architecture (SOA): Older sibling of microservices; coarser-grained services. Still relevant when integrating heterogeneous legacy systems.
- Serverless / FaaS: Distribution without operating servers; trade-offs around cold-start latency, vendor lock-in, and observability.
Think Like an Engineer
- Before adopting microservices, calculate the latency floor your topology imposes. If your SLO is p99 < 200ms and your chain has 8 hops × 5ms each, you are out of budget before any work happens.
- Run the availability math on your critical path quarterly. If your effective availability is 99.5% and you committed 99.9%, the math points at one or two services as the investment list.
- Treat every cross-service call as a contract: timeout, retry policy, error semantics, and observability are all part of the contract. Without them, the network has won.
Production Story
A growing fintech moved from monolith to microservices on the assumption it would “scale better”. Six months later they were paying 3x the cloud bill for the same throughput, debugging a 7-service request chain over Slack at 3am, and shipping slower because every change required cross-team coordination on shared infra. The retro found that two services genuinely needed independent scaling (search, recommendations); the other five were team-org artifacts. They consolidated back to a 3-service core with two specialised satellites and recovered both cost and velocity. The lesson: distribution is a tax you pay for benefits; if the benefits are not real, the tax is just a tax.
Career Relevance
Senior and staff engineers are evaluated on system-design judgment as much as code. The engineers who can explain WHY their architecture is distributed (and what they get for the tax) get trusted with bigger architectural calls. The engineers who default to microservices because “that's what production looks like” tend to ship slowly and burn budget. This module is the lens those judgments depend on.
Key Terms
- CAP Theorem
- In a distributed system, you can have at most two of Consistency, Availability, and Partition tolerance simultaneously.
- PACELC
- Extension of CAP: even without partition, trade off Latency vs Consistency.
- Availability
- Percentage of time the system serves successful responses; often expressed as nines (99.9%, 99.99%, ...).
- Fault tolerance
- The system continues to operate in some form when components fail.
- Linearizability
- The strongest consistency model: every operation appears to happen at a single instant.
Hands-On Labs
-
Lab 1.1 — Latency Simulation Across Service Boundaries
Measure how cross-service network hops accumulate latency in a real microservice topology.
45 minutes - Beginner
- Spin up 5 small services (Go or Python) on docker-compose
- Wire them in a chain: A → B → C → D → E
- Add 5ms artificial latency per hop
- Send 1000 requests through the chain and record p50/p95/p99
- Compare to a single-monolith implementation
- Plot the cumulative latency
-
Lab 1.2 — Failure Isolation Test
Demonstrate failure isolation: kill one microservice and observe how the system degrades vs how a monolith fails.
45 minutes - Beginner
- Use the same 5-service chain from Lab 1.1
- Add graceful degradation in service A: if D fails, return cached or partial response
- Send traffic, kill service D mid-flight
- Observe error rate, response codes, response shape
- Repeat with the monolithic implementation
-
Lab 1.3 — Availability Math
Calculate end-to-end availability for a real architecture and identify the weakest link.
30 minutes - Beginner
- Document a real microservice architecture you operate (or a fictional one with 6 services)
- Assign each service its measured or claimed availability
- Compute end-to-end availability for the critical path
- Identify the single change that would most improve overall SLO
- Document the find-and-fix recommendation
Key Takeaways
- A distributed system exists for failure isolation, independent scaling, and team autonomy — not just “scale”
- CAP/PACELC frame the trade-offs you must make consciously; pretending otherwise leads to surprise outages
- Latency is set by topology before it is set by code — bad architecture cannot be tuned
- Effective availability is the product of dependency availabilities — mind your critical path
- Plan for partial failure as a normal operating mode, not an emergency