Skip to main content

Module 4: Distributed Data Management

How modern systems split, replicate, and reconcile data across many machines — replication, sharding, quorums, consistency models, and the distributed databases that implement them.

5 hours. 3 hands-on labs. Free course module.

Learning Objectives

  • Pick between hash and range partitioning based on access patterns
  • Design replication strategies (single-leader, multi-leader, leaderless) and their failover behaviour
  • Apply quorum math (W + R > N) to choose consistency levels
  • Read a Cassandra / DynamoDB / PostgreSQL replication topology and predict its failure modes
  • Avoid the classic distributed-data anti-patterns: hot partitions, replication lag, write conflicts

Why This Matters

Distributed data is the hardest part of distributed systems — once data is split across machines, every read and write has to navigate replication lag, partition imbalance, and consistency trade-offs. Engineers who internalise the W+R>N rule, hot-partition mitigations, and the difference between sync and async replication design data layers that hold up. Engineers who skip the foundations end up reinventing distributed databases badly and debugging the same outages for years.

SHARDING + REPLICATION TOPOLOGYShard A (keys 0..k1)leaderreplicareplicaShard B (keys k1..k2)leaderreplicareplicaShard C (keys k2..k3)leaderreplicareplicaShard D (keys k3..)leaderreplicareplicaSharding splits the keyspace; replication protects each shard.Per-shard quorum: W=2 R=2 N=3. Cross-shard transactions are expensive — design to avoid them.
Architecture diagram for Module 4: Distributed Data Management.

Lesson Content

One machine cannot hold all your data and one machine cannot survive forever. Distributed data management is the discipline of splitting state across many machines (sharding) and keeping multiple copies of each piece (replication) so that the system stays available, durable, and fast enough — while exposing a coherent enough story to applications that they can be written without thinking about every node.

Replication Strategies

Three classic models, each a different trade-off:

  • Single-leader (PostgreSQL streaming, MySQL, MongoDB primary): all writes go through one leader. Followers replicate the leader's log. Simple to reason about, but the leader is a bottleneck and a SPOF (mitigated by automatic failover).
  • Multi-leader (multi-region MySQL with bidirectional replication, CRDT-backed systems): writes accepted at multiple nodes; conflicts resolved by merge rules. Harder; useful for geo-distributed systems where local writes matter.
  • Leaderless (Cassandra, DynamoDB, Riak): writes are sent to multiple replicas in parallel; reads are reconciled via quorum. No single “leader” per partition.

Synchronous vs asynchronous replication is orthogonal:

  • Synchronous: write is acknowledged after at least one replica confirms. Higher latency, no data loss on leader crash.
  • Asynchronous: write is acknowledged on leader commit; replicas catch up later. Lower latency; potential data loss window.
  • Semi-synchronous: hybrid; ack after any one replica confirms.

Sharding Strategies

Hash partitioning: partition = hash(key) % N. Excellent load distribution; range queries are expensive (every shard touched). Used by Cassandra, DynamoDB, Redis Cluster.

Range partitioning: each shard owns a contiguous key range. Range queries are efficient; hot spots are easy to create (a key prefix that is heavily written becomes a single-shard bottleneck). Used by HBase, BigTable, CockroachDB, MongoDB sharded clusters.

Consistent hashing: solves the “adding a node forces all keys to move” problem of naive hash partitioning. Each node is hashed onto a ring; each key belongs to the next node clockwise. Adding a node only moves 1/N of keys. Combined with virtual nodes (256 vnodes per physical node in Cassandra) for smoother rebalancing.

The Distributed Systems Algorithms guide goes deeper on consistent hashing math.

Quorum Math

With N replicas, write quorum W, and read quorum R, you achieve strong consistency when W + R > N. The intuition: any read overlaps any write by at least one node, so the latest write is visible.

  • N=3, W=2, R=2: tolerates 1 failure for both reads and writes; strong consistency. Standard Cassandra production.
  • N=3, W=3, R=1: every write hits all replicas; reads are fast and consistent; one failure blocks writes. Read-heavy workloads.
  • N=3, W=1, R=1: maximum availability and lowest latency; eventual consistency only. DynamoDB default.
  • N=5, W=3, R=3: tolerates 2 simultaneous failures; strong consistency. Mission-critical Cassandra clusters.

Cassandra exposes this as consistency levels (ONE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL). For multi-region deployments LOCAL_QUORUM is critical — it requires a quorum within the local datacentre but does not wait for cross-region acks.

Consistency Models in Practice

The same hierarchy you saw in Module 1 (Linearizable → Sequential → Causal → Read-your-writes → Eventual) shows up in every real database, often as tunable guarantees:

  • Cassandra: per-query consistency level.
  • MongoDB: readConcern (local, majority, linearizable) + writeConcern.
  • DynamoDB: per-call ConsistentRead flag.
  • Spanner / CockroachDB: linearizable by default, with explicit follower-read modes.

Distributed Database Choices

  • Postgres / MySQL: single-leader replication; horizontal scale via read replicas or external sharding (Citus, Vitess). Strong consistency on the leader; read lag on replicas.
  • Cassandra: leaderless, hash-partitioned, AP-leaning. High write throughput, tunable consistency. Great for write-heavy time-series; less great for low-latency reads of small data.
  • DynamoDB: managed, hash-partitioned, AP-leaning. Eventual or strong consistency per call. Excellent if you stay within its access patterns; expensive if you fight them.
  • CockroachDB: distributed SQL with per-range Raft consensus. Linearizable by default; horizontal scale; SQL surface. The heaviest of the modern options.
  • Spanner / TiDB: globally-distributed SQL with linearizable transactions backed by TrueTime / TSO. Premium pricing for premium consistency.
  • Redis Cluster: in-memory, hash-slot partitioned, asynchronous replication. Great cache or session store; not your primary database.

Common Pitfalls

  • Read-your-writes anomaly: writing to leader, reading from a replica, getting your own old value. Fix: stickify reads to leader for a short window after a write.
  • Hot partition: a partition key with skewed traffic (one tenant, one celebrity user) overloads one node. Fix: salted keys, increased shard count, multi-tenant isolation.
  • Cross-shard transactions: expensive (2PC, distributed locking). Design data models to keep related data in the same shard wherever possible.
  • Replication lag spikes: replicas fall behind during bulk writes; reads served from lagging replicas return stale data. Monitor lag in seconds or bytes; fail reads to leader past a threshold.

Quorum Write Flow

QUORUM WRITE: N=3, W=2 (Cassandra-style)Coordinatorpicks any nodeR1ackR2ackR3slow / down2 acks ≥ W=2 ✓ commitHinted handoffrepairs R3 laterW+R>N (e.g. W=2, R=2 with N=3) gives strong consistency under any single failure.

Self-Check Quiz

  1. N=5 Cassandra cluster. You write with CL=QUORUM, then immediately read with CL=ONE. Can you see your write? (Answer: not guaranteed. CL=ONE returns the first replica response, which may be a stale one. For strong consistency, you need W+R>N — e.g. CL=QUORUM on both.)
  2. Your single celebrity user generates 90% of writes for one shard. What is happening, and what is the fix? (Answer: hot partition. Fix: split the key with salting (user_id:0..N) and aggregate at read time, or scale out and use a different partitioning key.)
  3. Why does range-partitioned MongoDB sometimes accidentally create hot shards while hash-partitioned Cassandra rarely does? (Answer: range partitioning concentrates time-prefix or sequential keys on one shard; hash partitioning randomises. Choose the partition strategy based on access pattern.)
  4. You see replication lag of 30 seconds on a Postgres read replica. What three actions matter most? (Answer: alert if it exceeds threshold; route latency-sensitive reads to leader; investigate root cause — bulk writes, slow disk, or vacuum.)

For the algorithm-level treatment of consistent hashing, vector clocks, CRDTs, and quorum proofs, read the Distributed Systems Algorithms guide. For Kubernetes-specific operational patterns the Kubernetes cheatsheet is the fast reference.

Real-World Use Cases

  • Cassandra at Netflix replicates user data across multiple regions with LOCAL_QUORUM for low-latency reads.
  • DynamoDB Global Tables provide multi-region active-active with last-writer-wins by default.
  • CockroachDB uses per-range Raft groups to provide globally-consistent SQL with horizontal scale.
  • Discord migrated from MongoDB to Cassandra (then to ScyllaDB) for their messages workload due to Cassandra's leaderless write throughput.

Production Notes

  • Replication lag is a first-class metric to alert on. Past a threshold, fail reads back to the leader rather than serve stale data.
  • Hot partitions are the #1 distributed-data scalability bug. Detect via per-shard QPS dashboards; mitigate via key salting or tenant-aware shard routing.
  • Cross-shard transactions are expensive (2PC, distributed locking). Design data models to keep related data co-located in the same shard.

Common Mistakes

  • Choosing eventual consistency without thinking through the read-your-writes anomaly.
  • Single-region Cassandra with consistency level QUORUM — works fine until you go multi-region and discover the cross-region quorum cost.
  • Treating replicas as a read-scale solution when replication lag means stale reads.

Security Risks to Watch

  • Multi-tenant sharded systems leak data when the application forgets to scope queries by tenant_id; design tenant scoping into the storage layer, not the app.
  • Read replicas with different RBAC than the primary expose data the security team thought was protected.
  • Backups of distributed databases multiply the data-leak surface; encrypt at rest with KMS, audit access.
  • Cross-region replication over public internet must use TLS; a leaked snapshot in transit is a leaked database.

Design Tradeoffs

Single-leader (Postgres, MySQL)

Pros

  • Strong consistency on the leader
  • Simple to reason about
  • Atomic transactions

Cons

  • Leader is a write bottleneck
  • Failover is the SPOF reconciliation problem
  • Read replicas have lag

Leaderless (Cassandra, DynamoDB)

Pros

  • No single bottleneck
  • High write throughput
  • Tunable consistency

Cons

  • Complex consistency story
  • No cross-key transactions
  • Operational complexity

Multi-leader (CRDT-backed, multi-region MySQL)

Pros

  • Local writes in every region
  • No coordination latency

Cons

  • Conflict resolution required
  • Hard to reason about

Production Alternatives

  • Sharded Postgres (Citus, Vitess): Postgres with horizontal sharding; keep SQL surface, add scale.
  • Apache Cassandra / ScyllaDB: Leaderless, hash-partitioned, AP-leaning; best for high write throughput.
  • CockroachDB / TiDB / YugabyteDB: Distributed SQL with per-range consensus; SQL surface + horizontal scale.
  • Spanner / AlloyDB: Globally-consistent SQL; premium pricing for premium consistency.
  • DynamoDB: Managed, hash-partitioned; great if you stay within its access patterns, painful if you fight them.

Think Like an Engineer

  • Choose partition keys based on the access pattern, not the storage convenience. Keys that are easy to write but hard to query become technical debt.
  • For multi-tenant systems, design for the worst tenant. The 90th-percentile customer is your bottleneck before it's your average.
  • Consistency level is per query, not per database. Force engineers to make the choice deliberately for each operation.

Production Story

A SaaS platform launched a feature where one enterprise customer suddenly generated 90% of writes to a single shard. Cassandra p99 went from 10ms to 600ms within a day. The team scaled out vertically, then horizontally, neither helped — only one node was hot. The fix was salting the customer's key (customer_id:0..15) and aggregating reads across the salted partitions. p99 dropped back below 20ms. The lesson: skewed traffic is the rule, not the exception, for any multi-tenant system.

Key Terms

Sharding
Splitting data across multiple machines by partitioning the key space.
Quorum
Minimum number of nodes that must respond for an operation to be considered successful.
Replication lag
How far behind the leader a replica is; the staleness window for reads from that replica.
Hot partition
A partition with disproportionately high traffic, overloading one node.
Eventual consistency
Replicas converge to the same state if writes stop; the weakest useful guarantee.

Hands-On Labs

  1. Lab 4.1 — Postgres Streaming Replication + Failover

    Set up Postgres primary + replica, force failover, observe data loss window with sync vs async replication.

    60 minutes - Intermediate

    • Spin up primary + replica with docker-compose
    • Run async replication; cause primary crash mid-write; measure data loss
    • Switch to synchronous_commit=on; rerun; verify zero data loss

    View lab files on GitHub

  2. Lab 4.2 — Cassandra Quorum Behaviour

    Run a 3-node Cassandra cluster, vary consistency levels, observe behaviour during node failure.

    90 minutes - Intermediate

    • Spin up Cassandra cluster (3 nodes)
    • Write with CL=QUORUM; verify reads see latest
    • Kill one node; verify QUORUM still works
    • Kill two nodes; verify QUORUM fails; ONE still works (eventual)

    View lab files on GitHub

  3. Lab 4.3 — Hot Partition Reproduction

    Cause and mitigate a hot partition in Redis Cluster.

    45 minutes - Intermediate

    • Send 90% of traffic to one key
    • Observe per-node QPS; identify the hot node
    • Apply key salting (key:0..key:9); redistribute
    • Confirm load balances

    View lab files on GitHub

Key Takeaways

  • Replication is for durability and availability; sharding is for scale — you need both at scale
  • Quorum math (W + R > N) is the rule for strong consistency in leaderless systems
  • Hot partitions are the #1 distributed-data scalability bug; design key spaces deliberately
  • Cross-shard transactions are expensive; data models should make them rare
  • Replication lag is a first-class metric to alert on, not an implementation detail