Distributed Systems Algorithms

Distributed systems algorithms are the load-bearing wall of modern infrastructure. Every Kubernetes cluster, every Kafka pipeline, every Redis Cluster, every Cassandra ring is built on a small set of foundational ideas - consensus, replication, partitioning, gossip, quorum reads - combined in different proportions. Understand those ideas in their pure form, and the "magic" of every distributed system you will ever operate becomes legible.

This guide is the practitioner's walk through those algorithms: what each one solves, where it shows up in real production systems, and the trade-offs that determine whether your system can be available, consistent, and fast at the same time. By the end, you should be able to read an etcd outage post-mortem, a Cassandra hinted-handoff log, or a Kafka leader-election trace and understand exactly what is happening underneath.

The CAP Theorem - What Your System Can Promise

Eric Brewer's CAP theorem (formalised by Gilbert and Lynch in 2002) states that a distributed system can guarantee at most two of three properties at any moment: Consistency (every read sees the latest write), Availability (every request gets a response), and Partition tolerance (the system continues operating even when the network drops messages between nodes).

Network partitions are a real-world inevitability - cables get cut, NICs fail, packet loss spikes during deploys. So you do not get to opt out of P. The real choice in CAP is between C and A during a partition:

CP systems (etcd, ZooKeeper, HBase, Spanner) sacrifice availability during a partition. If a quorum of nodes cannot agree, the system refuses writes (and sometimes reads) rather than serve stale data.
AP systems (Cassandra, DynamoDB, Riak, CouchDB) sacrifice strong consistency during a partition. Every node keeps accepting reads and writes, and the system reconciles divergence later via conflict-resolution rules.

The PACELC extension (Daniel Abadi, 2010) makes the picture more honest: even when there is no Partition, you trade off between Latency and Consistency. Spanner is CP/EC (strict consistency at the cost of cross-region latency); DynamoDB is AP/EL (low latency at the cost of eventual consistency by default).

CAP THEOREM - PICK TWO DURING A PARTITION Consistency Availability Partition Tolerance CP etcd, ZooKeeper Spanner, HBase AP Cassandra, DynamoDB Riak, CouchDB CA (only when no partition) PACELC adds: even with NO partition, trade off Latency ↔ Consistency

The practical lesson: when you adopt a distributed datastore, do not ask "is it consistent or available?" - ask "during a network partition, would I rather take an outage or serve stale data?" The honest answer depends entirely on the workload. Payments cannot serve stale balances; product catalogue reads can.

Consensus - The Hardest Problem in Distributed Systems

Consensus is the problem of getting a group of nodes to agree on a single value, even when some of them fail or messages are lost. It is the foundation of every CP system - you need consensus to elect a leader, replicate a write, or coordinate a configuration change.

The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proved that no completely asynchronous consensus protocol can guarantee both safety and liveness in the presence of a single failed node. Real systems work around this with timeouts and randomised back-off, accepting that consensus may stall briefly but never produces wrong answers.

Paxos - Brilliant but Hard to Implement

Leslie Lamport's Paxos (1989, published 1998) was the first practical consensus algorithm. It guarantees safety (all nodes agree on the same value) under any failure pattern that does not partition more than half the nodes. It works in three phases - prepare, promise, accept - with a designated proposer, a set of acceptors, and learners who eventually receive the chosen value.

Paxos is correct but notoriously hard to implement. Lamport himself wrote a paper called "Paxos Made Simple" that is still considered difficult. Google's Chubby and the original Spanner implementations are Paxos-based; Google engineers have written about how much production code it takes to handle the edge cases.

Raft - Paxos for Humans

Raft (Ongaro and Ousterhout, 2014) was designed explicitly to be understandable. It guarantees the same safety properties as Paxos but decomposes the algorithm into three subproblems: leader election, log replication, and safety. Every node is in one of three states: follower, candidate, or leader.

Leader election: if a follower hears nothing from the leader for a randomised election timeout (typically 150–300ms), it transitions to candidate, increments its term number, and requests votes. A candidate that gathers a majority vote becomes the leader for that term.
Log replication: the leader appends client commands to its log, replicates them to followers, and commits each entry once a majority have acknowledged it. Followers apply committed entries to their state machines in order.
Safety: election restrictions ensure a candidate can only win if its log is at least as up-to-date as a majority of followers - so committed entries never get lost or overwritten.

RAFT LEADER ELECTION FOLLOWER Receives heartbeats from current leader Election timeout: 150–300ms random No heartbeat? ↓ become candidate CANDIDATE Increment term Vote for self RequestVote RPCs to all peers Majority votes? ↓ become leader Split vote → retry with new timeout LEADER Send AppendEntries heartbeats every ~50ms Accept client writes, replicate log to majority Commit on majority ack, apply to state machine Higher term seen → step down to follower

Raft in Production: etcd, Consul, CockroachDB

Raft has become the default consensus algorithm for new distributed systems. etcd (used by Kubernetes for cluster state) runs Raft groups of typically 3 or 5 nodes. Consul uses Raft for its catalogue and KV store. CockroachDB uses one Raft group per data range, with thousands of Raft groups per cluster. The HashiCorp Vault integrated storage backend is Raft-backed.

The operational implications matter. A 3-node Raft cluster tolerates 1 failure. A 5-node cluster tolerates 2. Adding more nodes does not help availability past a point - in fact, more nodes mean more network traffic for every commit and a larger quorum to reach. The classic recommendation is 3 nodes for most clusters, 5 for clusters that need to tolerate multiple simultaneous failures, and never even numbers (which give you no extra fault tolerance and a higher chance of split votes).

For a deeper walk through how SPIRE servers use distributed consensus to issue verifiable workload identities at scale, see the SPIRE Architecture & Components module in the free Mastering SPIFFE & SPIRE course.

Leader Election Beyond Consensus

Many systems need a single leader without running full consensus. Kafka elects partition leaders via the controller broker, which is itself elected via ZooKeeper (or, in KRaft mode, via an internal Raft group). Redis Sentinel elects a single Sentinel as the failover orchestrator using a quorum-based vote. Even a Kubernetes controller manager runs leader election - multiple replicas race to acquire a Lease object in the API server, and only the holder reconciles state.

The key property all leader-election protocols need is at most one leader per term. The Lamport-clock-style monotonic term number is what prevents split-brain - a deposed leader cannot keep accepting writes because every follower will reject any RPC tagged with an old term.

Distributed Locking

Closely related to leader election is distributed locking. A correct distributed lock requires consensus underneath - a node holding the lock must be the only node convinced it holds the lock. Naive Redis "SETNX with TTL" locks famously fail under network partitions (Martin Kleppmann's "How to Do Distributed Locking" is required reading). The Redlock algorithm tries to harden this with multiple Redis instances, but Kleppmann's critique demonstrates failure modes that still allow two nodes to believe they hold the lock.

The robust pattern is to use a CP system (etcd, ZooKeeper, Consul) and treat the lock as a leadership lease with a fencing token - a monotonic counter that increases every time the lock is acquired. The lock holder includes the fencing token in every operation; the resource (database, file system) rejects operations from older fencing tokens. This is what protects against the "old leader thinks it still holds the lock" scenario.

Replication Strategies

Replication is the process of keeping multiple copies of data in sync across nodes. The choice of replication strategy determines durability, latency, and consistency.

Synchronous vs Asynchronous Replication

Synchronous replication: writes are not acknowledged until at least one replica has confirmed. Higher latency, no data loss on primary failure. PostgreSQL streaming replication with synchronous_commit = on; Cassandra with QUORUM writes.
Asynchronous replication: writes are acknowledged as soon as the primary commits locally; replicas catch up later. Lower latency, but a window of potential data loss if the primary crashes before replicating. MySQL default; MongoDB with w: 1.
Semi-synchronous: a hybrid where writes are acknowledged after at least one (any one) replica confirms, regardless of which one. MySQL semi-sync, PostgreSQL synchronous_standby_names with FIRST or ANY.

Single-Leader, Multi-Leader, Leaderless

Single-leader replication (PostgreSQL, MySQL, MongoDB) routes all writes through a designated leader; followers replicate the leader's log. Simple to reason about, but the leader is a bottleneck and a single point of failure (mitigated by automatic failover).

Multi-leader replication (multi-region MySQL with bidirectional replication, CRDT-backed systems) accepts writes at any node. Conflict resolution becomes the central problem - two nodes can accept conflicting writes that need to be merged. Common in geo-distributed systems where local writes matter more than global consistency.

Leaderless replication (Cassandra, DynamoDB, Riak) sends every write to multiple replicas in parallel. There is no single "leader"; reads use a quorum to reconcile divergent replicas. This is the foundation of the Dynamo-style architecture.

REPLICATION MODELS SINGLE-LEADER MULTI-LEADER LEADERLESS LEADER R1 R2 R3 All writes through leader. PostgreSQL, MySQL, Mongo. L1 L2 R R Writes accepted by either. Conflict resolution required. N1 N2 N3 N4 N5 Writes to N replicas; reads via quorum. Cassandra, Dynamo.

Quorums - The Math of Distributed Reads and Writes

Leaderless systems use quorum reads and writes to balance consistency and availability. The math is simple: with N replicas, a write quorum of W, and a read quorum of R, you achieve strong consistency when W + R > N. The intuition: any read quorum overlaps with the most recent write quorum by at least one node, so the latest write is always visible.

N=3, W=2, R=2: tolerates 1 node failure for writes and reads, strong consistency, common Cassandra setup.
N=3, W=3, R=1: every write hits all replicas; reads are fast and consistent, but a single node failure blocks writes. Used for read-heavy, low-write workloads.
N=3, W=1, R=1: maximum availability and lowest latency, eventual consistency only. DynamoDB default.
N=5, W=3, R=3: tolerates 2 simultaneous failures, strong consistency. Used by mission-critical Cassandra clusters.

Cassandra exposes these as consistency levels: ONE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL. The LOCAL_QUORUM variant is critical for multi-region deployments - it requires a quorum within the local datacentre but does not wait for cross-region acks, dramatically reducing latency at the cost of cross-region consistency.

QUORUM CONSISTENCY: W + R > N ⇒ STRONG CONSISTENCY N=5, W=3, R=3, W+R=6 > 5 ✓ N1 N2 N3 N4 N5 W W W+R R R Write set {N1,N2,N3} overlaps Read set {N3,N4,N5} on N3 ⇒ latest write is always visible. N=5, W=2, R=2, W+R=4 < 5 ✗ Eventual consistency only. Read can miss latest write. Lower latency & higher availability. DynamoDB default behavior.

Partitioning - Spreading Data Across Nodes

Partitioning (also called sharding) is the strategy for distributing data across nodes so that each node owns a slice of the keyspace. Two design questions dominate: how do you choose the partition for a key, and how do you rebalance when nodes join or leave.

Hash-Based Partitioning vs Range Partitioning

Hash partitioning applies a hash function to the key and modulo-divides by the number of partitions. Excellent load distribution, but range queries are expensive (they must touch every partition). Used by Cassandra, DynamoDB, Redis Cluster.

Range partitioning assigns each node a contiguous range of keys. Range queries are efficient, but creating hot spots is easy (a key prefix that is heavily written becomes a single-node bottleneck). Used by HBase, Bigtable, CockroachDB, MongoDB sharded clusters.

Consistent Hashing

Naive hash partitioning has a serious flaw: when you add or remove a node, almost every key needs to move to a different partition (because hash(key) % N changes for almost every key when N changes). For a multi-terabyte cluster, that is an unsurvivable rebalance.

Consistent hashing (Karger et al., 1997) solves this by hashing both keys and nodes onto a single ring. Each key belongs to the next node clockwise on the ring. When a node is added, only the keys between it and the previous node on the ring need to move - typically 1/N of the keyset. When a node is removed, its keys move to the next clockwise node. Rebalancing is O(K/N) instead of O(K).

Real systems combine consistent hashing with virtual nodes (vnodes): each physical node is hashed onto the ring at hundreds or thousands of positions. This smooths out the inevitable hash-distribution unevenness and makes rebalancing more granular. Cassandra defaults to 256 vnodes per physical node; Dynamo and Riak use similar patterns.

CONSISTENT HASHING RING (with virtual nodes) A A A B B B C C C D D D key X hash(X) lands here → key X belongs to next clockwise node = B Add a node: only keys in its arc need to move.

Production Examples of Partitioning

Kafka: each topic is split into N partitions; each partition is assigned to one broker as its leader. The producer chooses the partition by hashing the message key (or round-robin if no key). Adding brokers requires explicit partition reassignment via kafka-reassign-partitions - Kafka does not automatically rebalance.
Redis Cluster: 16384 hash slots mapped to nodes. Adding a node moves a configurable subset of slots and their data via the CLUSTER ADDSLOTS + MIGRATE commands. Clients are aware of the slot mapping (via CLUSTER NODES) and route requests directly.
Cassandra: vnode-based consistent hashing. Token assignment is automatic; bootstrap of a new node automatically streams the right data ranges from existing nodes.
DynamoDB: opaque to the user. AWS handles partitioning, splitting, and rebalancing. Hot-partition issues are surfaced as throttling errors.

Gossip Protocols - Membership and Failure Detection at Scale

Gossip protocols (also called epidemic protocols) are how large clusters propagate state changes without centralised coordination. Each node periodically picks a few random peers and exchanges state - cluster membership, node liveness, schema versions, partition ownership. Information spreads exponentially: in O(log N) rounds, every node has the latest state.

Cassandra uses gossip for cluster membership and failure detection. Consul agents gossip via the Serf library (HashiCorp's gossip implementation, based on the SWIM protocol). Redis Cluster gossips slot ownership and node health. The HashiCorp Memberlist library powers gossip in many infrastructure tools.

The SWIM protocol (Scalable Weakly-consistent Infection-style Membership, Das et al. 2002) is the modern standard. SWIM has two interleaved components: a failure detector that pings random peers (with indirect-ping fallback to distinguish a real failure from network jitter) and a gossip channel that disseminates membership changes piggybacked on those pings. SWIM avoids the all-to-all heartbeat traffic of older systems while still detecting failures within seconds.

Vector Clocks and CRDTs - Reasoning About Concurrent Updates

In a leaderless or multi-leader system, two nodes can independently accept conflicting writes. Vector clocks let you detect concurrency: each node maintains a counter, every write is tagged with the full vector of counters at write time, and you can compare two writes to determine whether one happened-before the other or whether they are concurrent (and thus need conflict resolution).

Vector clocks tell you that there is a conflict; CRDTs (Conflict-free Replicated Data Types) tell you how to resolve it deterministically. CRDTs are data structures with merge operations that are commutative, associative, and idempotent - meaning replicas can apply updates in any order and still converge to the same state.

G-Counter: a grow-only counter where each replica owns its own slot; merge is element-wise max. Used for view counts, like counts.
PN-Counter: two G-Counters (positive and negative); supports increment and decrement.
OR-Set (Observed-Remove Set): tracks adds and removes with unique tags so a remove only affects observed adds, avoiding the "remove a not-yet-replicated add" anomaly.
LWW-Element-Set (Last-Write-Wins): each element has a timestamp; the highest timestamp wins. Simple but loses concurrent updates.

Production CRDT users include Riak (which exposes counter, set, and map CRDTs as native data types), Redis (the HyperLogLog structure is a probabilistic CRDT for cardinality estimation), and the Yjs / Automerge libraries that power collaborative editing in tools like Figma and Google Docs.

Consistency Models - What Your System Promises

The terminology around consistency is confusing because the database community and the distributed-systems community use overlapping but different vocabularies. The hierarchy from strongest to weakest:

Strict (Linearizability): every operation appears to happen at a single instant between its invocation and response. Spanner's TrueTime achieves this externally; etcd Raft achieves this for its log.
Sequential consistency: all clients see operations in the same order, but that order does not need to match wall-clock time. Single-leader systems with replication lag.
Causal consistency: writes that are causally related (e.g. a reply to a comment) are seen in causal order; concurrent writes can be reordered. The default in many session-aware systems.
Read-your-writes: a client always sees its own previous writes, even if other clients have not. Common in social-app feed semantics.
Eventual consistency: replicas converge to the same value if writes stop. The weakest useful guarantee. DynamoDB default; Cassandra with low consistency levels.

Many real systems offer tunable consistency: Cassandra lets you pick consistency level per query; MongoDB exposes readConcern and writeConcern; DynamoDB lets you flag ConsistentRead on a per-call basis. The right answer depends on the operation: a balance check needs strict consistency; a "number of likes" can tolerate eventual.

CONSISTENCY MODELS - STRENGTH vs LATENCY stronger ↑ cheaper / faster → Linearizable every op atomic at one instant; reads see latest write globally Spanner, etcd Sequential all clients see ops in same order; not necessarily wall-clock order single-leader DBs Causal causally related writes seen in causal order; concurrent ones can reorder session systems Read-your-writes a client always sees its own previous writes; others lag social feeds Eventual replicas converge if writes stop; no in-flight ordering guarantees DynamoDB default Tunable per query in Cassandra (CL=ONE..ALL), MongoDB (readConcern/writeConcern), DynamoDB (ConsistentRead).

Production Failure Scenarios

Split Brain

A network partition isolates two halves of a cluster, each of which elects a leader and accepts writes. When the partition heals, you have divergent state. Consensus algorithms prevent this by requiring a majority - one side of any partition has minority and cannot elect a leader. Systems without consensus (multi-leader, leaderless) require explicit conflict resolution or can simply lose data.

Clock Skew

Many algorithms (LWW timestamps, lease-based locking, distributed transactions) assume monotonic, synchronised clocks. Reality: NTP can drift, leap seconds happen, virtual machines can pause. Spanner solved this with TrueTime - a custom hardware-backed clock with bounded uncertainty. CockroachDB uses HLC (Hybrid Logical Clocks) to combine wall-clock time with logical counters. The lesson: do not trust clocks for correctness.

The Stuck etcd Cluster

The most common etcd outage in production Kubernetes clusters: a 3-node etcd cluster loses 2 nodes (perhaps a node-pool replacement gone wrong, or an AZ failure). The remaining node cannot reach quorum, refuses writes, and the entire Kubernetes API freezes. The recovery is a careful single-node restoration from backup, then re-adding the other members. The lesson: 5-node etcd clusters across 3 AZs for any production workload, and tested backup/restore runbooks.

Observability for Distributed Systems

Distributed systems fail in distributed ways. The minimum observability stack:

Distributed tracing (OpenTelemetry, Jaeger): see how a request flows across services and where latency accumulates.
Quorum-aware metrics: for Raft systems, expose leader-election count, log-commit lag per follower, snapshot frequency, and time-since-last-snapshot.
Replication lag: for any replicated datastore, the most important metric is replica lag in seconds (or bytes). This is your data-loss window if the primary fails right now.
Network partition simulation: regular fault injection (Chaos Mesh, Gremlin, custom iptables scripts) to verify failover behaviour before it happens for real.

For securing the control plane of your distributed systems - the API servers, the etcd clusters, the control-plane to data-plane communication - the free Cloud Native Security Engineering course covers the full picture from RBAC and network policy through to mTLS bootstrap with SPIFFE workload identity. For hands-on practice, try the Kubernetes Security Simulator.

Frequently Asked Questions

Is Raft simpler than Paxos in practice, or just easier to explain?

Both. The original Raft paper's explicit goal was understandability without sacrificing correctness, and it succeeded - the algorithm has a smaller state machine and a cleaner separation of concerns (election vs replication vs safety). In production, Raft implementations tend to have fewer subtle bugs than Paxos implementations of comparable maturity.

When would I run consensus instead of using a managed service like DynamoDB or Spanner?

If you can use a managed service, you almost always should - consensus is operationally hard. You run consensus directly when you need control-plane coordination (a Kubernetes-style controller), when you cannot rely on cloud services (on-prem, edge, regulated industries), or when your latency budget rules out a hosted database.

Are vector clocks still relevant given CRDTs?

Yes. Vector clocks detect concurrent updates; CRDTs resolve them. Many real systems use both: vector clocks (or similar lineage tracking) to identify divergent replicas, then CRDT semantics to merge them deterministically. Riak's data types are a textbook example.

What is the difference between gossip and consensus?

Consensus produces strict agreement on a single value (every node knows the same answer or no answer at all). Gossip produces eventual agreement on a set of facts (every node learns the truth eventually). You use consensus for correctness-critical state (cluster membership in CP systems, leader identity, configuration); you use gossip for soft state where eventual convergence is acceptable (failure detection, schema versions, partition ownership hints).

How does Kubernetes use consensus?

Kubernetes itself does not implement consensus - it relies on etcd to store all cluster state. Every kubectl apply ends as a Raft-replicated commit in etcd. The Kubernetes API server is a stateless proxy in front of etcd; the controller manager and scheduler use leader-election leases (also stored in etcd) to ensure only one replica is active at a time.

What is the "reading from a follower" consistency hazard?

A common pattern is to send writes to a leader and reads to followers (to scale read throughput). The hazard: a follower may not have replicated the most recent writes, so a client that just wrote a value can read its own old value. Solutions: stickify reads to the leader for a short window after each write; use a consistency token (like CockroachDB's follower_read_timestamp) that the follower waits to satisfy; or accept the inconsistency for non-critical reads.

Conclusion

Distributed systems algorithms are not magic. Every behaviour you see in production - etcd refusing writes during a partition, Cassandra returning slightly stale rows under LOCAL_QUORUM, Redis Cluster reshuffling slots when you add a node, Kafka pausing for milliseconds during a controller election - is the direct consequence of a small set of well-defined algorithms making correct trade-offs. Once you can name the algorithm, you can predict the failure mode; once you can predict the failure mode, you can design around it.

The high-leverage takeaways for production engineers: the partition-tolerance choice is the only meaningful one in CAP - pick whether you want availability or consistency under partition, and design around that; 3-node Raft for control-plane state, 5-node for higher-availability; quorum math (W + R > N) gives strong consistency with leaderless replication; consistent hashing with vnodes is the default partitioning scheme for any cluster you expect to grow; vector clocks detect divergence, CRDTs resolve it deterministically; do not trust clocks for correctness. Every distributed datastore you operate is an instance of these primitives composed in some configuration; reading docs through that lens makes them dramatically more comprehensible.

Where to Go Next

Distributed systems is a deep field; this guide is the foundation. Recommended next steps:

Read the Mastering SPIFFE & SPIRE course to see how distributed identity systems use these algorithms in production - SPIRE Server uses Raft for HA, federation uses gossip-style trust-bundle exchange, and SVID issuance is a textbook example of leader-routed writes.
Walk through the Kubernetes Foundations & Security module to understand how Kubernetes' etcd, controller leases, and API watch streams compose all of these algorithms into a real cluster.
Practice with the Kubernetes Security Simulator and the Zero Trust Network Builder to internalise the operational implications.
Read the original Raft paper (In Search of an Understandable Consensus Algorithm) and Martin Kleppmann's Designing Data-Intensive Applications for the depth this guide cannot fit.