Before
- Microservices because “everyone else does it”
- No timeout discipline; retries everywhere with no budget
- No availability math on the critical path
- Distributed system that performs worse than the monolith it replaced
Module 1 of 12
What a distributed system actually is, why we build them, and the trade-offs that define every design decision after this point.
Start here
Before
After
A distributed system is not multiple servers. A distributed system is what you get when failure of one component should not equal failure of the whole, when independent teams need to ship without coordinating every release, and when one machine is no longer enough to handle the load. Everything else — the consensus protocols, the service meshes, the observability pipelines — is mechanical detail that exists because we made the foundational choice to spread state and computation across many machines.
This module sets the mental model that every later module depends on. By the end you should be able to read a system architecture and name the trade-offs the designer made, predict the failure modes from the topology alone, and decide for any given service whether distribution is the right call or premature complexity.
The standard answer is “scale”. The honest answer is more nuanced. Real production teams move from monolith to distributed for one or more of:
The cost ledger is real too. Every distributed boundary introduces latency, partial failure, network unreliability, debugging complexity, deployment coordination, and observability work that did not exist in the monolith. Distribute when one of the reasons above outweighs the operational tax — and not before.
Eric Brewer's CAP theorem (formalised by Gilbert & Lynch in 2002) says: in a system with replication, you can have at most two of Consistency (every read sees the latest write), Availability (every request gets a non-error response), and Partition tolerance (the system continues operating across network partitions).
Network partitions are inevitable in real production environments — cables get cut, NICs fail, packet loss spikes during deploys. So you do not get to opt out of P. The actual question CAP forces is: during a partition, would you rather refuse writes (preserve consistency) or accept potentially stale data (preserve availability)?
The PACELC extension (Daniel Abadi, 2010) sharpens the picture: even when there is no Partition, you trade off between Latency and Consistency. Spanner is CP/EC (strict consistency at the cost of cross-region latency). DynamoDB is AP/EL by default (low latency at the cost of eventual consistency). The honest decision framework is PACELC, not just CAP.
The numbers every distributed-systems engineer should know:
Every microservice boundary you cross is at least 0.5ms in the same DC. Every cross-region call is 100ms. A user-facing request that passes through 8 services, hits a cross-region database, and waits on a cache miss can easily reach 500ms even if every service is healthy. The architecture determines the latency floor; you cannot tune your way out of bad topology.
Availability is typically expressed as a percentage of uptime over a window. The famous “nines” ladder:
Two practical realities. First, claimed availability rarely matches measured availability — cloud providers exclude maintenance windows, regional issues, and certain failure modes. Second, the dependency math is brutal: a service that depends on five 99.9% services has availability of 0.999^5 = 99.5%. Independent dependencies multiply, and your effective SLO is bounded by your weakest critical path.
Fault tolerance is the property that the system continues to operate (perhaps at reduced capacity) even when some components fail. The standard tools:
Modules 6 and 7 cover these patterns in depth. For now, the key idea: plan for partial failure as a normal mode of operation, not an emergency.
From strongest to weakest:
The right consistency depends on the operation: a balance check needs strong consistency; a “number of likes” can tolerate eventual. Many systems offer tunable consistency at query time (Cassandra CL=QUORUM vs CL=ONE; MongoDB readConcern; DynamoDB ConsistentRead).
The next 11 modules walk you through every layer of a real distributed system:
For deeper foundational reading, the Distributed Systems Algorithms guide goes into Raft/Paxos, quorum math, vector clocks, and CRDTs at the algorithm level. For hands-on practice, the Kubernetes Security Simulator exercises the operational decisions that matter from Module 8 onwards. Reach for the Kubernetes cheatsheet when you need a fast operational reference.
Real world
Production notes
Common mistakes
Security risks
Tradeoffs
Pros
Cons
Pros
Cons
Pros
Cons
Alternatives
Single deploy unit with strict module boundaries. Best for early-stage products before team scale forces distribution.
Industry default once 50+ engineers; pays the operational tax for team-autonomy benefits.
Older sibling of microservices; coarser-grained services. Still relevant when integrating heterogeneous legacy systems.
Distribution without operating servers; trade-offs around cold-start latency, vendor lock-in, and observability.
Think like an engineer
Key terms
In a distributed system, you can have at most two of Consistency, Availability, and Partition tolerance simultaneously.
Extension of CAP: even without partition, trade off Latency vs Consistency.
Percentage of time the system serves successful responses; often expressed as nines (99.9%, 99.99%, ...).
The system continues to operate in some form when components fail.
The strongest consistency model: every operation appears to happen at a single instant.
Labs
Measure how cross-service network hops accumulate latency in a real microservice topology.
Demonstrate failure isolation: kill one microservice and observe how the system degrades vs how a monolith fails.
Calculate end-to-end availability for a real architecture and identify the weakest link.
Recap
Related resources