Module 4 of 12

Distributed Data Management

How modern systems split, replicate, and reconcile data across many machines — replication, sharding, quorums, consistency models, and the distributed databases that implement them.

5 hours3 labsFree

Start here

Learning objectives

  • Pick between hash and range partitioning based on access patterns
  • Design replication strategies (single-leader, multi-leader, leaderless) and their failover behaviour
  • Apply quorum math (W + R > N) to choose consistency levels
  • Read a Cassandra / DynamoDB / PostgreSQL replication topology and predict its failure modes
  • Avoid the classic distributed-data anti-patterns: hot partitions, replication lag, write conflicts

Before

  • Single Postgres primary; vertical scaling until the box maxes out
  • Hot keys cause unexplained latency; team scales the cluster, no improvement
  • Multi-region by accident; cross-region writes drag p99 beyond the SLO
  • Stale reads from replicas confuse users; no consistency model documented

After

  • Sharded data layer chosen for the access pattern; horizontal scale as a normal mode
  • Per-shard QPS dashboards; hot keys detected before users are affected
  • Multi-region with LOCAL_QUORUM (Cassandra) or sharded ownership (CockroachDB)
  • Consistency level per query; engineers make the choice with eyes open
SHARDING + REPLICATION TOPOLOGYShard A (keys 0..k1)leaderreplicareplicaShard B (keys k1..k2)leaderreplicareplicaShard C (keys k2..k3)leaderreplicareplicaShard D (keys k3..)leaderreplicareplicaSharding splits the keyspace; replication protects each shard.Per-shard quorum: W=2 R=2 N=3. Cross-shard transactions are expensive — design to avoid them.

One machine cannot hold all your data and one machine cannot survive forever. Distributed data management is the discipline of splitting state across many machines (sharding) and keeping multiple copies of each piece (replication) so that the system stays available, durable, and fast enough — while exposing a coherent enough story to applications that they can be written without thinking about every node.

Replication Strategies

Three classic models, each a different trade-off:

  • Single-leader (PostgreSQL streaming, MySQL, MongoDB primary): all writes go through one leader. Followers replicate the leader's log. Simple to reason about, but the leader is a bottleneck and a SPOF (mitigated by automatic failover).
  • Multi-leader (multi-region MySQL with bidirectional replication, CRDT-backed systems): writes accepted at multiple nodes; conflicts resolved by merge rules. Harder; useful for geo-distributed systems where local writes matter.
  • Leaderless (Cassandra, DynamoDB, Riak): writes are sent to multiple replicas in parallel; reads are reconciled via quorum. No single “leader” per partition.

Synchronous vs asynchronous replication is orthogonal:

  • Synchronous: write is acknowledged after at least one replica confirms. Higher latency, no data loss on leader crash.
  • Asynchronous: write is acknowledged on leader commit; replicas catch up later. Lower latency; potential data loss window.
  • Semi-synchronous: hybrid; ack after any one replica confirms.

Sharding Strategies

Hash partitioning: partition = hash(key) % N. Excellent load distribution; range queries are expensive (every shard touched). Used by Cassandra, DynamoDB, Redis Cluster.

Range partitioning: each shard owns a contiguous key range. Range queries are efficient; hot spots are easy to create (a key prefix that is heavily written becomes a single-shard bottleneck). Used by HBase, BigTable, CockroachDB, MongoDB sharded clusters.

Consistent hashing: solves the “adding a node forces all keys to move” problem of naive hash partitioning. Each node is hashed onto a ring; each key belongs to the next node clockwise. Adding a node only moves 1/N of keys. Combined with virtual nodes (256 vnodes per physical node in Cassandra) for smoother rebalancing.

The Distributed Systems Algorithms guide goes deeper on consistent hashing math.

Quorum Math

With N replicas, write quorum W, and read quorum R, you achieve strong consistency when W + R > N. The intuition: any read overlaps any write by at least one node, so the latest write is visible.

  • N=3, W=2, R=2: tolerates 1 failure for both reads and writes; strong consistency. Standard Cassandra production.
  • N=3, W=3, R=1: every write hits all replicas; reads are fast and consistent; one failure blocks writes. Read-heavy workloads.
  • N=3, W=1, R=1: maximum availability and lowest latency; eventual consistency only. DynamoDB default.
  • N=5, W=3, R=3: tolerates 2 simultaneous failures; strong consistency. Mission-critical Cassandra clusters.

Cassandra exposes this as consistency levels (ONE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL). For multi-region deployments LOCAL_QUORUM is critical — it requires a quorum within the local datacentre but does not wait for cross-region acks.

Consistency Models in Practice

The same hierarchy you saw in Module 1 (Linearizable → Sequential → Causal → Read-your-writes → Eventual) shows up in every real database, often as tunable guarantees:

  • Cassandra: per-query consistency level.
  • MongoDB: readConcern (local, majority, linearizable) + writeConcern.
  • DynamoDB: per-call ConsistentRead flag.
  • Spanner / CockroachDB: linearizable by default, with explicit follower-read modes.

Distributed Database Choices

  • Postgres / MySQL: single-leader replication; horizontal scale via read replicas or external sharding (Citus, Vitess). Strong consistency on the leader; read lag on replicas.
  • Cassandra: leaderless, hash-partitioned, AP-leaning. High write throughput, tunable consistency. Great for write-heavy time-series; less great for low-latency reads of small data.
  • DynamoDB: managed, hash-partitioned, AP-leaning. Eventual or strong consistency per call. Excellent if you stay within its access patterns; expensive if you fight them.
  • CockroachDB: distributed SQL with per-range Raft consensus. Linearizable by default; horizontal scale; SQL surface. The heaviest of the modern options.
  • Spanner / TiDB: globally-distributed SQL with linearizable transactions backed by TrueTime / TSO. Premium pricing for premium consistency.
  • Redis Cluster: in-memory, hash-slot partitioned, asynchronous replication. Great cache or session store; not your primary database.

Common Pitfalls

  • Read-your-writes anomaly: writing to leader, reading from a replica, getting your own old value. Fix: stickify reads to leader for a short window after a write.
  • Hot partition: a partition key with skewed traffic (one tenant, one celebrity user) overloads one node. Fix: salted keys, increased shard count, multi-tenant isolation.
  • Cross-shard transactions: expensive (2PC, distributed locking). Design data models to keep related data in the same shard wherever possible.
  • Replication lag spikes: replicas fall behind during bulk writes; reads served from lagging replicas return stale data. Monitor lag in seconds or bytes; fail reads to leader past a threshold.

Quorum Write Flow

QUORUM WRITE: N=3, W=2 (Cassandra-style)Coordinatorpicks any nodeR1ackR2ackR3slow / down2 acks ≥ W=2 ✓ commitHinted handoffrepairs R3 laterW+R>N (e.g. W=2, R=2 with N=3) gives strong consistency under any single failure.

Self-Check Quiz

  1. N=5 Cassandra cluster. You write with CL=QUORUM, then immediately read with CL=ONE. Can you see your write? (Answer: not guaranteed. CL=ONE returns the first replica response, which may be a stale one. For strong consistency, you need W+R>N — e.g. CL=QUORUM on both.)
  2. Your single celebrity user generates 90% of writes for one shard. What is happening, and what is the fix? (Answer: hot partition. Fix: split the key with salting (user_id:0..N) and aggregate at read time, or scale out and use a different partitioning key.)
  3. Why does range-partitioned MongoDB sometimes accidentally create hot shards while hash-partitioned Cassandra rarely does? (Answer: range partitioning concentrates time-prefix or sequential keys on one shard; hash partitioning randomises. Choose the partition strategy based on access pattern.)
  4. You see replication lag of 30 seconds on a Postgres read replica. What three actions matter most? (Answer: alert if it exceeds threshold; route latency-sensitive reads to leader; investigate root cause — bulk writes, slow disk, or vacuum.)

For the algorithm-level treatment of consistent hashing, vector clocks, CRDTs, and quorum proofs, read the Distributed Systems Algorithms guide. For Kubernetes-specific operational patterns the Kubernetes cheatsheet is the fast reference.

Real world

Where this shows up

  • Cassandra at Netflix replicates user data across multiple regions with LOCAL_QUORUM for low-latency reads.
  • DynamoDB Global Tables provide multi-region active-active with last-writer-wins by default.
  • CockroachDB uses per-range Raft groups to provide globally-consistent SQL with horizontal scale.
  • Discord migrated from MongoDB to Cassandra (then to ScyllaDB) for their messages workload due to Cassandra's leaderless write throughput.

Production notes

Keep these close

  • Replication lag is a first-class metric to alert on. Past a threshold, fail reads back to the leader rather than serve stale data.
  • Hot partitions are the #1 distributed-data scalability bug. Detect via per-shard QPS dashboards; mitigate via key salting or tenant-aware shard routing.
  • Cross-shard transactions are expensive (2PC, distributed locking). Design data models to keep related data co-located in the same shard.

Common mistakes

What usually breaks

  • Choosing eventual consistency without thinking through the read-your-writes anomaly.
  • Single-region Cassandra with consistency level QUORUM — works fine until you go multi-region and discover the cross-region quorum cost.
  • Treating replicas as a read-scale solution when replication lag means stale reads.

Security risks

Threats to watch

  • Multi-tenant sharded systems leak data when the application forgets to scope queries by tenant_id; design tenant scoping into the storage layer, not the app.
  • Read replicas with different RBAC than the primary expose data the security team thought was protected.
  • Backups of distributed databases multiply the data-leak surface; encrypt at rest with KMS, audit access.
  • Cross-region replication over public internet must use TLS; a leaked snapshot in transit is a leaked database.

Tradeoffs

Design choices you should be able to defend

Single-leader (Postgres, MySQL)

Pros

  • Strong consistency on the leader
  • Simple to reason about
  • Atomic transactions

Cons

  • Leader is a write bottleneck
  • Failover is the SPOF reconciliation problem
  • Read replicas have lag

Leaderless (Cassandra, DynamoDB)

Pros

  • No single bottleneck
  • High write throughput
  • Tunable consistency

Cons

  • Complex consistency story
  • No cross-key transactions
  • Operational complexity

Multi-leader (CRDT-backed, multi-region MySQL)

Pros

  • Local writes in every region
  • No coordination latency

Cons

  • Conflict resolution required
  • Hard to reason about

Alternatives

Other production approaches

Sharded Postgres (Citus, Vitess)

Postgres with horizontal sharding; keep SQL surface, add scale.

Apache Cassandra / ScyllaDB

Leaderless, hash-partitioned, AP-leaning; best for high write throughput.

CockroachDB / TiDB / YugabyteDB

Distributed SQL with per-range consensus; SQL surface + horizontal scale.

Spanner / AlloyDB

Globally-consistent SQL; premium pricing for premium consistency.

DynamoDB

Managed, hash-partitioned; great if you stay within its access patterns, painful if you fight them.

Think like an engineer

Questions to answer before shipping

  • Choose partition keys based on the access pattern, not the storage convenience. Keys that are easy to write but hard to query become technical debt.
  • For multi-tenant systems, design for the worst tenant. The 90th-percentile customer is your bottleneck before it's your average.
  • Consistency level is per query, not per database. Force engineers to make the choice deliberately for each operation.

Key terms

Vocabulary used in this module

Sharding

Splitting data across multiple machines by partitioning the key space.

Quorum

Minimum number of nodes that must respond for an operation to be considered successful.

Replication lag

How far behind the leader a replica is; the staleness window for reads from that replica.

Hot partition

A partition with disproportionately high traffic, overloading one node.

Eventual consistency

Replicas converge to the same state if writes stop; the weakest useful guarantee.

Labs

Hands-on labs

60 minutesIntermediate

Lab 4.1 — Postgres Streaming Replication + Failover

Set up Postgres primary + replica, force failover, observe data loss window with sync vs async replication.

  1. Spin up primary + replica with docker-compose
  2. Run async replication; cause primary crash mid-write; measure data loss
  3. Switch to synchronous_commit=on; rerun; verify zero data loss
View lab on GitHub
90 minutesIntermediate

Lab 4.2 — Cassandra Quorum Behaviour

Run a 3-node Cassandra cluster, vary consistency levels, observe behaviour during node failure.

  1. Spin up Cassandra cluster (3 nodes)
  2. Write with CL=QUORUM; verify reads see latest
  3. Kill one node; verify QUORUM still works
  4. Kill two nodes; verify QUORUM fails; ONE still works (eventual)
View lab on GitHub
45 minutesIntermediate

Lab 4.3 — Hot Partition Reproduction

Cause and mitigate a hot partition in Redis Cluster.

  1. Send 90% of traffic to one key
  2. Observe per-node QPS; identify the hot node
  3. Apply key salting (key:0..key:9); redistribute
  4. Confirm load balances
View lab on GitHub

Recap

Key takeaways

  • Replication is for durability and availability; sharding is for scale — you need both at scale
  • Quorum math (W + R > N) is the rule for strong consistency in leaderless systems
  • Hot partitions are the #1 distributed-data scalability bug; design key spaces deliberately
  • Cross-shard transactions are expensive; data models should make them rare
  • Replication lag is a first-class metric to alert on, not an implementation detail

Related resources

Keep learning across CodersSecret