Module 1 of 12

Foundations of Distributed Systems

What a distributed system actually is, why we build them, and the trade-offs that define every design decision after this point.

3 hours3 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Define a distributed system from a production-engineering perspective
Understand why distributed systems replace monoliths and what it costs you
Internalise CAP and PACELC as decision frameworks, not academic theorems
Reason about latency, availability, fault tolerance, and consistency as a coupled system
Build the mental model that every later module depends on

Before

Microservices because “everyone else does it”
No timeout discipline; retries everywhere with no budget
No availability math on the critical path
Distributed system that performs worse than the monolith it replaced

After

Distribution as a deliberate choice with concrete benefits identified upfront
Every cross-service call has timeout + retry policy + error semantics documented
Quarterly availability math review; investment list from the math
CAP/PACELC-aware design; conscious choice of consistency vs availability per service

A distributed system is not multiple servers. A distributed system is what you get when failure of one component should not equal failure of the whole, when independent teams need to ship without coordinating every release, and when one machine is no longer enough to handle the load. Everything else - the consensus protocols, the service meshes, the observability pipelines - is mechanical detail that exists because we made the foundational choice to spread state and computation across many machines.

This module sets the mental model that every later module depends on. By the end you should be able to read a system architecture and name the trade-offs the designer made, predict the failure modes from the topology alone, and decide for any given service whether distribution is the right call or premature complexity.

Why Distribute? The Real Reasons

The standard answer is “scale”. The honest answer is more nuanced. Real production teams move from monolith to distributed for one or more of:

Failure isolation - if the recommendation service crashes, the checkout flow should still work. A monolith dies as one process; distributed services degrade independently.
Independent deploys - a 200-engineer org cannot rally around a single deploy train. Microservices let teams ship without lockstep coordination.
Independent scaling - the search service may need 10x compute while the user-profile service needs 1x. A monolith forces them to scale together.
Geographic distribution - users in Singapore expect low latency from Singapore. A single-region monolith cannot serve global traffic well.
Heterogeneous storage - one service needs Postgres, another needs Redis, a third needs S3. Distribution lets each pick its store.

The cost ledger is real too. Every distributed boundary introduces latency, partial failure, network unreliability, debugging complexity, deployment coordination, and observability work that did not exist in the monolith. Distribute when one of the reasons above outweighs the operational tax - and not before.

The CAP Theorem - A Decision Tool, Not a Theorem

Eric Brewer's CAP theorem (formalised by Gilbert & Lynch in 2002) says: in a system with replication, you can have at most two of Consistency (every read sees the latest write), Availability (every request gets a non-error response), and Partition tolerance (the system continues operating across network partitions).

Network partitions are inevitable in real production environments - cables get cut, NICs fail, packet loss spikes during deploys. So you do not get to opt out of P. The actual question CAP forces is: during a partition, would you rather refuse writes (preserve consistency) or accept potentially stale data (preserve availability)?

CP systems (etcd, ZooKeeper, Spanner): refuse writes during a partition rather than diverge. Used for control-plane state, leader election, configuration.
AP systems (Cassandra, DynamoDB default, Riak): keep accepting reads/writes; reconcile divergent replicas later. Used for high-availability data planes.

The PACELC extension (Daniel Abadi, 2010) sharpens the picture: even when there is no Partition, you trade off between Latency and Consistency. Spanner is CP/EC (strict consistency at the cost of cross-region latency). DynamoDB is AP/EL by default (low latency at the cost of eventual consistency). The honest decision framework is PACELC, not just CAP.

Latency - The Tax You Pay

The numbers every distributed-systems engineer should know:

L1 cache reference: ~0.5ns
Main memory reference: ~100ns (200x slower than L1)
SSD random read: ~150µs
Network round-trip same-DC: ~0.5ms
Network round-trip same-region: ~1–5ms
Network round-trip cross-continent: ~80–200ms

Every microservice boundary you cross is at least 0.5ms in the same DC. Every cross-region call is 100ms. A user-facing request that passes through 8 services, hits a cross-region database, and waits on a cache miss can easily reach 500ms even if every service is healthy. The architecture determines the latency floor; you cannot tune your way out of bad topology.

Availability and the “Nines”

Availability is typically expressed as a percentage of uptime over a window. The famous “nines” ladder:

99% (two nines) ⇒ 3.65 days of downtime per year
99.9% (three nines) ⇒ 8.76 hours per year
99.99% (four nines) ⇒ ~52 minutes per year
99.999% (five nines) ⇒ ~5.26 minutes per year

Two practical realities. First, claimed availability rarely matches measured availability - cloud providers exclude maintenance windows, regional issues, and certain failure modes. Second, the dependency math is brutal: a service that depends on five 99.9% services has availability of 0.999^5 = 99.5%. Independent dependencies multiply, and your effective SLO is bounded by your weakest critical path.

Fault Tolerance - Designing for “When”, Not “If”

Fault tolerance is the property that the system continues to operate (perhaps at reduced capacity) even when some components fail. The standard tools:

Redundancy - multiple replicas behind a load balancer; if one dies, others take over.
Timeouts - do not wait forever for a dead dependency. Always set a timeout and have a fallback.
Retries with exponential backoff and jitter - a failed call is retried, but with increasing delay (and randomness) so you do not hammer a recovering service.
Circuit breakers - after N consecutive failures, stop calling the failing dependency for a window so it can recover.
Bulkheads - isolate workloads so noisy neighbours cannot starve critical paths.
Graceful degradation - when a non-critical dependency fails, return reduced functionality rather than full error.

Modules 6 and 7 cover these patterns in depth. For now, the key idea: plan for partial failure as a normal mode of operation, not an emergency.

Consistency Models - What You Promise

From strongest to weakest:

Linearizable - every operation appears to happen at a single instant; reads see the latest write globally. Spanner, etcd Raft.
Sequential consistency - all clients see the same order of operations, not necessarily wall-clock order. Single-leader DBs.
Causal consistency - causally related operations are seen in causal order; concurrent operations may reorder. Common in collaborative editing.
Read-your-writes - a client sees its own writes; other clients may lag.
Eventual consistency - replicas converge if writes stop. The weakest useful guarantee. DynamoDB default.

The right consistency depends on the operation: a balance check needs strong consistency; a “number of likes” can tolerate eventual. Many systems offer tunable consistency at query time (Cassandra CL=QUORUM vs CL=ONE; MongoDB readConcern; DynamoDB ConsistentRead).

How This Course Is Structured

The next 11 modules walk you through every layer of a real distributed system:

Modules 2–3: how services talk (networking, gRPC, events, Kafka).
Modules 4–5: how data is split, replicated, and agreed upon (replication, sharding, consensus).
Modules 6–7: how systems handle scale and failure (autoscaling, circuit breakers, chaos).
Module 8: the security primitives that hold modern distributed systems together (Zero Trust, mTLS, SPIFFE).
Module 9: how you observe and debug distributed systems (tracing, metrics, logs).
Module 10: how Kubernetes changes everything.
Module 11: real failure scenarios you will see in production.
Module 12: how to design end-to-end production systems.

For deeper foundational reading, the Distributed Systems Algorithms guide goes into Raft/Paxos, quorum math, vector clocks, and CRDTs at the algorithm level. For hands-on practice, the Kubernetes Security Simulator exercises the operational decisions that matter from Module 8 onwards. Reach for the Kubernetes cheatsheet when you need a fast operational reference.

The CAP Triangle Visualised

Latency Propagation Across Service Boundaries

Self-Check Quiz

You operate a payment system with 5 microservices on the critical path, each at 99.95% availability. What is your effective availability? (Answer: 0.9995^5 ≈ 99.75%, or ~22 hours of downtime per year. The math always points at one or two services as your investment.)
During a network partition, your CP database refuses writes. The product team asks if you can "just keep accepting writes and reconcile later." How do you frame the trade-off? (Answer: that is exactly the AP choice; it requires designed conflict resolution - LWW, CRDTs, or human reconciliation. CP and AP are not interchangeable mid-flight.)
A user complains that the recommendation panel sometimes shows stale recommendations after they update preferences. What are your three options? (Answer: linearizable reads from the leader; read-your-writes via session affinity; cache invalidation on update with a short TTL fallback.)
Why is “just add more servers” the wrong first move when a service is slow? (Answer: scaling pushes the bottleneck downstream; if the database connection pool is exhausted, more replicas just exhaust it faster. Identify the bottleneck before scaling.)

Real world

Where this shows up

Stripe runs critical payment paths as a monolith with microservice satellites - chose simplicity for the money path, distribution for the periphery.
Shopify operates a Rails “majestic monolith” for the storefront with carved-out services for checkout and search; the architecture is a deliberate trade-off, not the result of an accident.
Amazon's famous “two-pizza team” rule was as much about deployment isolation (each team owns its services end-to-end) as about scaling.
Segment publicly migrated FROM microservices BACK to a monolith for parts of their stack when the operational complexity outweighed the benefits.

Production notes

Keep these close

Track a critical-path availability dashboard (multiply each dependency's SLO) so the org sees the math, not the wishful thinking.
Every cross-service call gets a timeout. Default: <em>do not let your services have unbounded patience</em>. Specific timeouts are part of every service contract.
When you run the availability math, the answer always points at one or two services. That is your investment list, not a hypothetical.

Common mistakes

What usually breaks

Adopting microservices because “everyone else does” before measuring whether failure isolation, independent scaling, or team autonomy actually justify the operational tax.
Treating CAP as a textbook quiz question rather than a runtime decision - the question is “during a real partition, what should this service do?”
Assuming dependencies have advertised availability when measured availability is materially different.

Security risks

Threats to watch

Every distributed boundary is a new attack surface. Service-to-service calls without authn become trust-by-network-position, the model that Zero Trust replaces.
Distributed deployments multiply credential management cost. Long-lived shared secrets in env vars across services become impossible to rotate.
Failure-isolation only works if security is also isolated. A compromised auth service in a 5-service architecture must not give the attacker the keys to all five.

Tradeoffs

Design choices you should be able to defend

Distributed (microservices)

Pros

Failure isolation
Independent deploys
Independent scaling per service
Team autonomy at scale

Cons

Network latency on every boundary
Operational complexity (observability, deployment, security)
Distributed-systems failure modes (split brain, partial failure)
Higher cloud cost

Monolith

Pros

In-process calls (~ns latency)
Simple operational model
Atomic transactions across the entire app
Lower compute cost

Cons

Single failure domain
Lockstep deploys
Vertical scaling only
Coordination tax across teams

Modular monolith

Pros

Most monolith benefits + clean module boundaries
Refactor-able into microservices later
Lowest operational cost for early-stage products

Cons

Module boundaries enforced by discipline, not infrastructure
Still a single deploy unit

Alternatives

Other production approaches

Modular monolith

Single deploy unit with strict module boundaries. Best for early-stage products before team scale forces distribution.

Microservices on Kubernetes

Industry default once 50+ engineers; pays the operational tax for team-autonomy benefits.

Service-Oriented Architecture (SOA)

Older sibling of microservices; coarser-grained services. Still relevant when integrating heterogeneous legacy systems.

Serverless / FaaS

Distribution without operating servers; trade-offs around cold-start latency, vendor lock-in, and observability.

Think like an engineer

Questions to answer before shipping

Before adopting microservices, calculate the latency floor your topology imposes. If your SLO is p99 < 200ms and your chain has 8 hops × 5ms each, you are out of budget before any work happens.
Run the availability math on your critical path quarterly. If your effective availability is 99.5% and you committed 99.9%, the math points at one or two services as the investment list.
Treat every cross-service call as a contract: timeout, retry policy, error semantics, and observability are all part of the contract. Without them, the network has won.

Key terms

Vocabulary used in this module

CAP Theorem

In a distributed system, you can have at most two of Consistency, Availability, and Partition tolerance simultaneously.

PACELC

Extension of CAP: even without partition, trade off Latency vs Consistency.

Availability

Percentage of time the system serves successful responses; often expressed as nines (99.9%, 99.99%, ...).

Fault tolerance

The system continues to operate in some form when components fail.

Linearizability

The strongest consistency model: every operation appears to happen at a single instant.

Labs

Hands-on labs

45 minutesBeginner

Lab 1.1 - Latency Simulation Across Service Boundaries

Measure how cross-service network hops accumulate latency in a real microservice topology.

Spin up 5 small services (Go or Python) on docker-compose
Wire them in a chain: A → B → C → D → E
Add 5ms artificial latency per hop
Send 1000 requests through the chain and record p50/p95/p99
Compare to a single-monolith implementation
Plot the cumulative latency

View lab on GitHub

45 minutesBeginner

Lab 1.2 - Failure Isolation Test

Demonstrate failure isolation: kill one microservice and observe how the system degrades vs how a monolith fails.

Use the same 5-service chain from Lab 1.1
Add graceful degradation in service A: if D fails, return cached or partial response
Send traffic, kill service D mid-flight
Observe error rate, response codes, response shape
Repeat with the monolithic implementation

View lab on GitHub

30 minutesBeginner

Lab 1.3 - Availability Math

Calculate end-to-end availability for a real architecture and identify the weakest link.

Document a real microservice architecture you operate (or a fictional one with 6 services)
Assign each service its measured or claimed availability
Compute end-to-end availability for the critical path
Identify the single change that would most improve overall SLO
Document the find-and-fix recommendation

View lab on GitHub

Recap

Key takeaways

A distributed system exists for failure isolation, independent scaling, and team autonomy - not just “scale”
CAP/PACELC frame the trade-offs you must make consciously; pretending otherwise leads to surprise outages
Latency is set by topology before it is set by code - bad architecture cannot be tuned
Effective availability is the product of dependency availabilities - mind your critical path
Plan for partial failure as a normal operating mode, not an emergency

Related resources

Foundations of Distributed Systems

Learning objectives

Why Distribute? The Real Reasons

The CAP Theorem - A Decision Tool, Not a Theorem

Latency - The Tax You Pay

Availability and the “Nines”

Fault Tolerance - Designing for “When”, Not “If”

Consistency Models - What You Promise

How This Course Is Structured

The CAP Triangle Visualised

Latency Propagation Across Service Boundaries

Self-Check Quiz

Where this shows up

Keep these close

What usually breaks

Threats to watch

Design choices you should be able to defend

Distributed (microservices)

Monolith

Modular monolith

Other production approaches

Modular monolith

Microservices on Kubernetes

Service-Oriented Architecture (SOA)

Serverless / FaaS

Questions to answer before shipping

Vocabulary used in this module

CAP Theorem

PACELC

Availability

Fault tolerance

Linearizability

Hands-on labs

Lab 1.1 - Latency Simulation Across Service Boundaries

Lab 1.2 - Failure Isolation Test

Lab 1.3 - Availability Math

Key takeaways

Keep learning across CodersSecret

Related guides

Cheatsheets

Interactive labs

Glossary terms