Module 5: Consensus & Coordination
How distributed nodes agree — Raft, Paxos, leader election, distributed locking, and the etcd / ZooKeeper / Consul systems that production runs on.
5 hours. 3 hands-on labs. Free course module.
Learning Objectives
- Explain consensus as a problem and why it is fundamental to CP systems
- Walk through Raft leader election, log replication, and safety in detail
- Compare Raft and Paxos and pick between them in practice
- Implement distributed locking correctly (with fencing tokens, not just SETNX)
- Operate etcd, ZooKeeper, or Consul without taking down your cluster
Why This Matters
Engineers who understand consensus stop being scared of etcd. They can read a Raft log replay, recover a stuck cluster, and design a system around the trade-offs of CP versus AP rather than tripping over them. This is the module that separates engineers who treat distributed coordination as “magic” from engineers who treat it as load-bearing infrastructure they own.
Lesson Content
Consensus is the hardest problem in distributed systems. It is also the most under-appreciated, because by the time you are using it — via Kubernetes etcd, Consul, HashiCorp Vault HA storage, CockroachDB ranges — someone else has implemented it correctly and you mostly do not notice it's there. Until you do.
What Consensus Is
The problem: a group of nodes must agree on a single value, even when some of them fail or messages are dropped. If they agree on a value, all surviving nodes must agree on the same value. The agreement must be safe under any failure pattern that does not partition more than half the nodes (the FLP impossibility result, 1985).
Consensus is the basis of every CP system. To have a single leader, you need consensus on who the leader is. To have replicated state, you need consensus on what the next state should be. To do distributed locking correctly, you need consensus on who holds the lock.
Raft — Consensus for Humans
Raft (Ongaro & Ousterhout, 2014) was designed to be understandable. It decomposes consensus into three subproblems — leader election, log replication, and safety — and provides the same correctness guarantees as Paxos with a much simpler mental model.
Leader election: every node is in one of three states (follower, candidate, leader). Followers expect heartbeats from the leader; if they do not arrive within a randomised election timeout (150–300ms), the follower transitions to candidate, increments the term number, and requests votes from peers. A candidate that gathers a majority becomes leader for that term. Term numbers are monotonic; older terms are rejected.
Log replication: clients send commands to the leader. The leader appends to its log, sends AppendEntries to followers, and commits the entry once a majority have acknowledged. Followers apply committed entries to their state machines in order.
Safety: election rules ensure a candidate can only win if its log is at least as up-to-date as a majority of followers. This guarantees committed entries are never lost.
Paxos and Why You Probably Don't Implement It
Paxos (Lamport, 1989) was the first practical consensus algorithm and is still mathematically influential. In production it has been largely displaced by Raft for new systems because Raft is meaningfully easier to implement correctly. Google's Chubby and Spanner use Paxos variants; etcd, Consul, CockroachDB, HashiCorp Vault, and most modern distributed systems use Raft.
Cluster Sizing
The fundamental rule: a Raft (or Paxos) cluster of 2N+1 nodes tolerates N simultaneous failures. So:
- 3 nodes ⇒ tolerates 1 failure.
- 5 nodes ⇒ tolerates 2 failures.
- 7 nodes ⇒ tolerates 3 failures (rarely used; commit latency suffers).
Always use odd numbers. A 4-node cluster requires 3 to commit; same fault tolerance as 3 nodes, more network traffic. A 6-node cluster requires 4 to commit; same fault tolerance as 5.
Distributed Locking
The naive Redis lock (SET lock value NX EX 30) famously fails under network partition: a client believes it holds the lock; the network blips; the lock TTL expires; another client acquires the lock; the original client wakes up and acts as if it still holds the lock; you have two writers.
The robust pattern: fencing tokens. The lock service issues a monotonically increasing token with each acquisition. The lock holder includes the token in every operation; the resource (database, file system) rejects operations from older tokens. Even if two clients believe they hold the lock, only the higher-token operation succeeds. etcd, ZooKeeper, and Consul all support this pattern; Redis does not natively.
etcd, ZooKeeper, Consul — The Production Trio
- etcd: Raft-based KV store; the substrate of Kubernetes; used by Vault HA, Rook. Optimised for correctness and Kubernetes integration.
- ZooKeeper: ZAB-based (similar to Raft) coordination service; older, mature, used by HBase, Kafka (legacy), Solr.
- Consul: Raft-based service registry + KV + health checks + service mesh; HashiCorp's integrated approach.
For new infrastructure, etcd is the default. For ZooKeeper-backed legacy systems (HBase, older Kafka), running ZooKeeper is the path of least resistance. Consul shines when you want the service-registry features alongside the coordination primitives.
Operational Hazards
- Stuck quorum: 3-node etcd loses 2 nodes; remaining node cannot make progress. Recovery: single-node restoration from snapshot, then re-add members. Practice this drill.
- Slow disk: Raft commit requires fsync; a slow disk slows every write across the cluster. Use SSDs; alert on fsync latency.
- Network partition: minority side cannot elect a leader and refuses writes; majority side keeps working. Verify clients fail-fast on the minority side rather than hanging.
- Cluster too large: more than 7 voters and commit latency suffers; consider learner nodes for scale-out.
Distributed Lock with Fencing Token
Self-Check Quiz
- You have a 4-node Raft cluster. How many failures can it tolerate? (Answer: still only 1. 2N+1 with N=1 (so 3 nodes) and N=2 (so 5 nodes); 4 nodes need 3 to commit, same as 3-node cluster but with more network traffic. Always odd.)
- Why does the naive Redis SETNX lock fail under network partition? (Answer: TTL expires while client thinks it holds the lock; another client acquires; original client wakes up and acts as if it still has the lock. Need fencing tokens.)
- etcd commit latency suddenly tripled. What is the most likely cause? (Answer: slow disk fsync. Raft commit requires fsync on the leader and majority followers. SSDs and disk-latency monitoring matter.)
- Your 3-node etcd cluster lost 2 nodes. What is the recovery path? (Answer: do NOT add nodes to a stuck cluster. Snapshot from the surviving node, single-node restoration, then add members one at a time. Test this drill quarterly.)
For deeper coverage of how SPIRE Server uses Raft for HA, see the SPIRE Architecture & Components module. The SPIFFE/SPIRE cheatsheet is the fast operational reference once you start running etcd-backed identity systems. The underlying identity primitive (SPIFFE) is the standard the rest of the modern stack converges on.
Real-World Use Cases
- etcd backs every Kubernetes cluster ever deployed; Raft is in your production stack whether you knew it or not.
- CockroachDB runs thousands of Raft groups per cluster, one per data range.
- HashiCorp Vault HA storage uses Raft for replicated secret state.
- Consul uses Raft for the catalogue and KV store; Serf for the gossip layer.
Production Notes
- Always use odd-number Raft clusters: 3 for most workloads, 5 for high availability. Never even.
- Practice etcd snapshot recovery quarterly. The runbook for recovering a stuck quorum is the difference between minutes and hours of cluster downtime.
- Run etcd on dedicated SSDs with low fsync latency. Slow disks slow every write across the cluster.
Common Mistakes
- Inventing your own “HA” with naive locks (Redis SETNX) instead of using consensus primitives. Always fails under partition.
- Adding a node to a stuck Raft cluster instead of restoring from snapshot. New nodes need a quorum to join.
- Running consensus on shared infrastructure (etcd on the same disk as your database). Slow neighbour = stuck quorum.
Security Risks to Watch
- etcd unencrypted at rest exposes every Kubernetes secret in the snapshot. Always enable KMS-backed encryption.
- etcd peer / client TLS is not on by default in some installers; verify before going to production.
- A compromised etcd member can poison cluster state. Audit etcd member additions and rotate certs regularly.
- Distributed locks without fencing tokens silently allow stale lock holders to corrupt resources.
Design Tradeoffs
Raft
Pros
- Designed for understandability
- Stable, well-implemented in many libraries
- Standard for new systems
Cons
- Newer than Paxos
- Performance very slightly behind multi-Paxos in some workloads
Paxos / Multi-Paxos
Pros
- Mathematically influential
- Battle-tested in Google Spanner / Chubby
Cons
- Notoriously hard to implement correctly
- Many subtle production bugs
3-node Raft cluster
Pros
- Tolerates 1 failure
- Lowest commit latency
- Cheapest infra
Cons
- Loss of 2 nodes = stuck quorum
5-node Raft cluster
Pros
- Tolerates 2 failures
- Survives multi-AZ outages
Cons
- Higher commit latency
- More infra cost
Production Alternatives
- etcd: Raft-based KV; substrate of Kubernetes; default for new infra.
- ZooKeeper: ZAB-based; mature, used by HBase / older Kafka / Solr.
- Consul: Raft + service registry + KV + health checks; HashiCorp's integrated approach.
- Apache Bookkeeper: Used as the backing store for Pulsar; consensus + ledger storage.
- Hazelcast / Apache Ignite: In-memory data grids with consensus-backed coordination; fits when latency < 1ms required.
Think Like an Engineer
- Before you reach for a distributed lock, ask: do I need lock semantics, or do I need leader-elected singleton execution? They are different problems with different primitives.
- When you see “HA via two replicas” in a design doc, ask: what happens during a partition? If the answer is fuzzy, you do not have HA — you have two SPOFs.
Production Story
A 3-node etcd cluster on AWS lost two nodes during an AZ event. The remaining node could not reach quorum and refused all writes. The Kubernetes API froze. The on-call team panicked and tried to add new nodes, which made things worse — new members cannot join a cluster without quorum. Recovery required restoring from snapshot to a single node, then re-adding members one at a time. Total outage: 2 hours. Lesson: 5-node etcd across 3 AZs from the start, plus a tested runbook the team has actually executed in a drill.
Key Terms
- Consensus
- The problem of getting a group of distributed nodes to agree on a single value, even with failures.
- Raft
- A consensus algorithm designed to be understandable; used by etcd, Consul, CockroachDB.
- Quorum
- Majority of nodes; required for any progress in Raft / Paxos.
- Fencing token
- Monotonic token issued by a lock service; prevents stale lock holders from corrupting state.
- Leader election
- Process by which a single node is chosen to coordinate; foundational to Raft and many distributed systems.
Hands-On Labs
-
Lab 5.1 — etcd Cluster Bootstrap and Failover
Run a 3-node etcd cluster, write data, kill one node, verify availability; kill two, observe stuck quorum.
90 minutes - Intermediate
- Bootstrap 3-node etcd via docker-compose
- Write keys, observe replication
- Kill one node; verify writes still succeed
- Kill two nodes; verify writes block
- Restore from snapshot; re-add members
-
Lab 5.2 — Distributed Lock with Fencing Token
Implement a correct distributed lock using etcd Lease + fencing token; demonstrate why naive locks fail.
90 minutes - Advanced
- Implement naive Redis SETNX lock; reproduce partition failure mode
- Implement etcd-based lock with fencing token
- Resource (Postgres) rejects operations from old tokens
- Demonstrate two-client race; only one operation succeeds
-
Lab 5.3 — Leader Election in Application Code
Build a singleton-task pattern: many replicas, only one runs the periodic job at a time.
60 minutes - Intermediate
- Use Kubernetes Lease object as election primitive
- Run 3 replicas; only the leader executes the cron logic
- Kill the leader; verify another replica takes over within seconds
Key Takeaways
- Consensus is the substrate of every CP system; Kubernetes, Vault, and most modern infra rely on Raft etcd
- Use Raft for new systems; Paxos is the older, harder alternative
- Cluster size: 2N+1 tolerates N failures; always odd; 3 or 5 for nearly all real workloads
- Distributed locking requires consensus + fencing tokens, not naive SETNX
- A stuck quorum is recoverable but only if you have a tested runbook; never improvise