Module 5 of 12

Consensus & Coordination

How distributed nodes agree - Raft, Paxos, leader election, distributed locking, and the etcd / ZooKeeper / Consul systems that production runs on.

5 hours3 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Explain consensus as a problem and why it is fundamental to CP systems
Walk through Raft leader election, log replication, and safety in detail
Compare Raft and Paxos and pick between them in practice
Implement distributed locking correctly (with fencing tokens, not just SETNX)
Operate etcd, ZooKeeper, or Consul without taking down your cluster

Before

Naive Redis SETNX locks; silent corruption under partition
Single etcd node “for simplicity”; one disk failure = entire cluster unrecoverable
No documented runbook for stuck quorum; outages last hours
ZooKeeper running on shared infrastructure; noisy neighbour starves consensus

After

Distributed locks via etcd Lease + fencing token; resource rejects stale tokens
5-node etcd across 3 AZs; tested snapshot restore runbook
Quorum loss recovery practiced quarterly; outages bounded to minutes
Dedicated etcd nodes with SSD; fsync latency monitored as first-class metric

Consensus is the hardest problem in distributed systems. It is also the most under-appreciated, because by the time you are using it - via Kubernetes etcd, Consul, HashiCorp Vault HA storage, CockroachDB ranges - someone else has implemented it correctly and you mostly do not notice it's there. Until you do.

What Consensus Is

The problem: a group of nodes must agree on a single value, even when some of them fail or messages are dropped. If they agree on a value, all surviving nodes must agree on the same value. The agreement must be safe under any failure pattern that does not partition more than half the nodes (the FLP impossibility result, 1985).

Consensus is the basis of every CP system. To have a single leader, you need consensus on who the leader is. To have replicated state, you need consensus on what the next state should be. To do distributed locking correctly, you need consensus on who holds the lock.

Raft - Consensus for Humans

Raft (Ongaro & Ousterhout, 2014) was designed to be understandable. It decomposes consensus into three subproblems - leader election, log replication, and safety - and provides the same correctness guarantees as Paxos with a much simpler mental model.

Leader election: every node is in one of three states (follower, candidate, leader). Followers expect heartbeats from the leader; if they do not arrive within a randomised election timeout (150–300ms), the follower transitions to candidate, increments the term number, and requests votes from peers. A candidate that gathers a majority becomes leader for that term. Term numbers are monotonic; older terms are rejected.

Log replication: clients send commands to the leader. The leader appends to its log, sends AppendEntries to followers, and commits the entry once a majority have acknowledged. Followers apply committed entries to their state machines in order.

Safety: election rules ensure a candidate can only win if its log is at least as up-to-date as a majority of followers. This guarantees committed entries are never lost.

Paxos and Why You Probably Don't Implement It

Paxos (Lamport, 1989) was the first practical consensus algorithm and is still mathematically influential. In production it has been largely displaced by Raft for new systems because Raft is meaningfully easier to implement correctly. Google's Chubby and Spanner use Paxos variants; etcd, Consul, CockroachDB, HashiCorp Vault, and most modern distributed systems use Raft.

Cluster Sizing

The fundamental rule: a Raft (or Paxos) cluster of 2N+1 nodes tolerates N simultaneous failures. So:

3 nodes ⇒ tolerates 1 failure.
5 nodes ⇒ tolerates 2 failures.
7 nodes ⇒ tolerates 3 failures (rarely used; commit latency suffers).

Always use odd numbers. A 4-node cluster requires 3 to commit; same fault tolerance as 3 nodes, more network traffic. A 6-node cluster requires 4 to commit; same fault tolerance as 5.

Distributed Locking

The naive Redis lock (SET lock value NX EX 30) famously fails under network partition: a client believes it holds the lock; the network blips; the lock TTL expires; another client acquires the lock; the original client wakes up and acts as if it still holds the lock; you have two writers.

The robust pattern: fencing tokens. The lock service issues a monotonically increasing token with each acquisition. The lock holder includes the token in every operation; the resource (database, file system) rejects operations from older tokens. Even if two clients believe they hold the lock, only the higher-token operation succeeds. etcd, ZooKeeper, and Consul all support this pattern; Redis does not natively.

etcd, ZooKeeper, Consul - The Production Trio

etcd: Raft-based KV store; the substrate of Kubernetes; used by Vault HA, Rook. Optimised for correctness and Kubernetes integration.
ZooKeeper: ZAB-based (similar to Raft) coordination service; older, mature, used by HBase, Kafka (legacy), Solr.
Consul: Raft-based service registry + KV + health checks + service mesh; HashiCorp's integrated approach.

For new infrastructure, etcd is the default. For ZooKeeper-backed legacy systems (HBase, older Kafka), running ZooKeeper is the path of least resistance. Consul shines when you want the service-registry features alongside the coordination primitives.

Operational Hazards

Stuck quorum: 3-node etcd loses 2 nodes; remaining node cannot make progress. Recovery: single-node restoration from snapshot, then re-add members. Practice this drill.
Slow disk: Raft commit requires fsync; a slow disk slows every write across the cluster. Use SSDs; alert on fsync latency.
Network partition: minority side cannot elect a leader and refuses writes; majority side keeps working. Verify clients fail-fast on the minority side rather than hanging.
Cluster too large: more than 7 voters and commit latency suffers; consider learner nodes for scale-out.

Distributed Lock with Fencing Token

Self-Check Quiz

You have a 4-node Raft cluster. How many failures can it tolerate? (Answer: still only 1. 2N+1 with N=1 (so 3 nodes) and N=2 (so 5 nodes); 4 nodes need 3 to commit, same as 3-node cluster but with more network traffic. Always odd.)
Why does the naive Redis SETNX lock fail under network partition? (Answer: TTL expires while client thinks it holds the lock; another client acquires; original client wakes up and acts as if it still has the lock. Need fencing tokens.)
etcd commit latency suddenly tripled. What is the most likely cause? (Answer: slow disk fsync. Raft commit requires fsync on the leader and majority followers. SSDs and disk-latency monitoring matter.)
Your 3-node etcd cluster lost 2 nodes. What is the recovery path? (Answer: do NOT add nodes to a stuck cluster. Snapshot from the surviving node, single-node restoration, then add members one at a time. Test this drill quarterly.)

For deeper coverage of how SPIRE Server uses Raft for HA, see the SPIRE Architecture & Components module. The SPIFFE/SPIRE cheatsheet is the fast operational reference once you start running etcd-backed identity systems. The underlying identity primitive (SPIFFE) is the standard the rest of the modern stack converges on.

Real world

Where this shows up

etcd backs every Kubernetes cluster ever deployed; Raft is in your production stack whether you knew it or not.
CockroachDB runs thousands of Raft groups per cluster, one per data range.
HashiCorp Vault HA storage uses Raft for replicated secret state.
Consul uses Raft for the catalogue and KV store; Serf for the gossip layer.

Production notes

Keep these close

Always use odd-number Raft clusters: 3 for most workloads, 5 for high availability. Never even.
Practice etcd snapshot recovery quarterly. The runbook for recovering a stuck quorum is the difference between minutes and hours of cluster downtime.
Run etcd on dedicated SSDs with low fsync latency. Slow disks slow every write across the cluster.

Common mistakes

What usually breaks

Inventing your own “HA” with naive locks (Redis SETNX) instead of using consensus primitives. Always fails under partition.
Adding a node to a stuck Raft cluster instead of restoring from snapshot. New nodes need a quorum to join.
Running consensus on shared infrastructure (etcd on the same disk as your database). Slow neighbour = stuck quorum.

Security risks

Threats to watch

etcd unencrypted at rest exposes every Kubernetes secret in the snapshot. Always enable KMS-backed encryption.
etcd peer / client TLS is not on by default in some installers; verify before going to production.
A compromised etcd member can poison cluster state. Audit etcd member additions and rotate certs regularly.
Distributed locks without fencing tokens silently allow stale lock holders to corrupt resources.

Tradeoffs

Design choices you should be able to defend

Raft

Pros

Designed for understandability
Stable, well-implemented in many libraries
Standard for new systems

Cons

Newer than Paxos
Performance very slightly behind multi-Paxos in some workloads

Paxos / Multi-Paxos

Pros

Mathematically influential
Battle-tested in Google Spanner / Chubby

Cons

Notoriously hard to implement correctly
Many subtle production bugs

3-node Raft cluster

Pros

Tolerates 1 failure
Lowest commit latency
Cheapest infra

Cons

Loss of 2 nodes = stuck quorum

5-node Raft cluster

Pros

Tolerates 2 failures
Survives multi-AZ outages

Cons

Higher commit latency
More infra cost

Alternatives

Other production approaches

etcd

Raft-based KV; substrate of Kubernetes; default for new infra.

ZooKeeper

ZAB-based; mature, used by HBase / older Kafka / Solr.

Consul

Raft + service registry + KV + health checks; HashiCorp's integrated approach.

Apache Bookkeeper

Used as the backing store for Pulsar; consensus + ledger storage.

Hazelcast / Apache Ignite

In-memory data grids with consensus-backed coordination; fits when latency < 1ms required.

Think like an engineer

Questions to answer before shipping

Before you reach for a distributed lock, ask: do I need lock semantics, or do I need leader-elected singleton execution? They are different problems with different primitives.
When you see “HA via two replicas” in a design doc, ask: what happens during a partition? If the answer is fuzzy, you do not have HA - you have two SPOFs.

Key terms

Vocabulary used in this module

Consensus

The problem of getting a group of distributed nodes to agree on a single value, even with failures.

Raft

A consensus algorithm designed to be understandable; used by etcd, Consul, CockroachDB.

Quorum

Majority of nodes; required for any progress in Raft / Paxos.

Fencing token

Monotonic token issued by a lock service; prevents stale lock holders from corrupting state.

Leader election

Process by which a single node is chosen to coordinate; foundational to Raft and many distributed systems.

Labs

Hands-on labs

90 minutesIntermediate

Lab 5.1 - etcd Cluster Bootstrap and Failover

Run a 3-node etcd cluster, write data, kill one node, verify availability; kill two, observe stuck quorum.

Bootstrap 3-node etcd via docker-compose
Write keys, observe replication
Kill one node; verify writes still succeed
Kill two nodes; verify writes block
Restore from snapshot; re-add members

View lab on GitHub

90 minutesAdvanced

Lab 5.2 - Distributed Lock with Fencing Token

Implement a correct distributed lock using etcd Lease + fencing token; demonstrate why naive locks fail.

Implement naive Redis SETNX lock; reproduce partition failure mode
Implement etcd-based lock with fencing token
Resource (Postgres) rejects operations from old tokens
Demonstrate two-client race; only one operation succeeds

View lab on GitHub

60 minutesIntermediate

Lab 5.3 - Leader Election in Application Code

Build a singleton-task pattern: many replicas, only one runs the periodic job at a time.

Use Kubernetes Lease object as election primitive
Run 3 replicas; only the leader executes the cron logic
Kill the leader; verify another replica takes over within seconds

View lab on GitHub

Recap

Key takeaways

Consensus is the substrate of every CP system; Kubernetes, Vault, and most modern infra rely on Raft etcd
Use Raft for new systems; Paxos is the older, harder alternative
Cluster size: 2N+1 tolerates N failures; always odd; 3 or 5 for nearly all real workloads
Distributed locking requires consensus + fencing tokens, not naive SETNX
A stuck quorum is recoverable but only if you have a tested runbook; never improvise

Related resources

Consensus & Coordination

Learning objectives

What Consensus Is

Raft - Consensus for Humans

Paxos and Why You Probably Don't Implement It

Cluster Sizing

Distributed Locking

etcd, ZooKeeper, Consul - The Production Trio

Operational Hazards

Distributed Lock with Fencing Token

Self-Check Quiz

Where this shows up

Keep these close

What usually breaks

Threats to watch

Design choices you should be able to defend

Raft

Paxos / Multi-Paxos

3-node Raft cluster

5-node Raft cluster

Other production approaches

etcd

ZooKeeper

Consul

Apache Bookkeeper

Hazelcast / Apache Ignite

Questions to answer before shipping

Vocabulary used in this module

Consensus

Raft

Quorum

Fencing token

Leader election

Hands-on labs

Lab 5.1 - etcd Cluster Bootstrap and Failover

Lab 5.2 - Distributed Lock with Fencing Token

Lab 5.3 - Leader Election in Application Code

Key takeaways

Keep learning across CodersSecret

Related guides

Cheatsheets

Interactive labs

Glossary terms