How modern systems split, replicate, and reconcile data across many machines — replication, sharding, quorums, consistency models, and the distributed databases that implement them.
-Pick between hash and range partitioning based on access patterns
-Design replication strategies (single-leader, multi-leader, leaderless) and their failover behaviour
-Apply quorum math (W + R > N) to choose consistency levels
-Read a Cassandra / DynamoDB / PostgreSQL replication topology and predict its failure modes
-Avoid the classic distributed-data anti-patterns: hot partitions, replication lag, write conflicts
Before
-Single Postgres primary; vertical scaling until the box maxes out
-Hot keys cause unexplained latency; team scales the cluster, no improvement
-Multi-region by accident; cross-region writes drag p99 beyond the SLO
-Stale reads from replicas confuse users; no consistency model documented
After
+Sharded data layer chosen for the access pattern; horizontal scale as a normal mode
+Per-shard QPS dashboards; hot keys detected before users are affected
+Multi-region with LOCAL_QUORUM (Cassandra) or sharded ownership (CockroachDB)
+Consistency level per query; engineers make the choice with eyes open
One machine cannot hold all your data and one machine cannot survive forever. Distributed data management is the discipline of splitting state across many machines (sharding) and keeping multiple copies of each piece (replication) so that the system stays available, durable, and fast enough — while exposing a coherent enough story to applications that they can be written without thinking about every node.
Replication Strategies
Three classic models, each a different trade-off:
Single-leader (PostgreSQL streaming, MySQL, MongoDB primary): all writes go through one leader. Followers replicate the leader's log. Simple to reason about, but the leader is a bottleneck and a SPOF (mitigated by automatic failover).
Multi-leader (multi-region MySQL with bidirectional replication, CRDT-backed systems): writes accepted at multiple nodes; conflicts resolved by merge rules. Harder; useful for geo-distributed systems where local writes matter.
Leaderless (Cassandra, DynamoDB, Riak): writes are sent to multiple replicas in parallel; reads are reconciled via quorum. No single “leader” per partition.
Synchronous vs asynchronous replication is orthogonal:
Synchronous: write is acknowledged after at least one replica confirms. Higher latency, no data loss on leader crash.
Asynchronous: write is acknowledged on leader commit; replicas catch up later. Lower latency; potential data loss window.
Semi-synchronous: hybrid; ack after any one replica confirms.
Sharding Strategies
Hash partitioning: partition = hash(key) % N. Excellent load distribution; range queries are expensive (every shard touched). Used by Cassandra, DynamoDB, Redis Cluster.
Range partitioning: each shard owns a contiguous key range. Range queries are efficient; hot spots are easy to create (a key prefix that is heavily written becomes a single-shard bottleneck). Used by HBase, BigTable, CockroachDB, MongoDB sharded clusters.
Consistent hashing: solves the “adding a node forces all keys to move” problem of naive hash partitioning. Each node is hashed onto a ring; each key belongs to the next node clockwise. Adding a node only moves 1/N of keys. Combined with virtual nodes (256 vnodes per physical node in Cassandra) for smoother rebalancing.
With N replicas, write quorum W, and read quorum R, you achieve strong consistency when W + R > N. The intuition: any read overlaps any write by at least one node, so the latest write is visible.
N=3, W=2, R=2: tolerates 1 failure for both reads and writes; strong consistency. Standard Cassandra production.
N=3, W=3, R=1: every write hits all replicas; reads are fast and consistent; one failure blocks writes. Read-heavy workloads.
N=3, W=1, R=1: maximum availability and lowest latency; eventual consistency only. DynamoDB default.
Cassandra exposes this as consistency levels (ONE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL). For multi-region deployments LOCAL_QUORUM is critical — it requires a quorum within the local datacentre but does not wait for cross-region acks.
Consistency Models in Practice
The same hierarchy you saw in Module 1 (Linearizable → Sequential → Causal → Read-your-writes → Eventual) shows up in every real database, often as tunable guarantees:
Spanner / CockroachDB: linearizable by default, with explicit follower-read modes.
Distributed Database Choices
Postgres / MySQL: single-leader replication; horizontal scale via read replicas or external sharding (Citus, Vitess). Strong consistency on the leader; read lag on replicas.
Cassandra: leaderless, hash-partitioned, AP-leaning. High write throughput, tunable consistency. Great for write-heavy time-series; less great for low-latency reads of small data.
DynamoDB: managed, hash-partitioned, AP-leaning. Eventual or strong consistency per call. Excellent if you stay within its access patterns; expensive if you fight them.
CockroachDB: distributed SQL with per-range Raft consensus. Linearizable by default; horizontal scale; SQL surface. The heaviest of the modern options.
Spanner / TiDB: globally-distributed SQL with linearizable transactions backed by TrueTime / TSO. Premium pricing for premium consistency.
Redis Cluster: in-memory, hash-slot partitioned, asynchronous replication. Great cache or session store; not your primary database.
Common Pitfalls
Read-your-writes anomaly: writing to leader, reading from a replica, getting your own old value. Fix: stickify reads to leader for a short window after a write.
Hot partition: a partition key with skewed traffic (one tenant, one celebrity user) overloads one node. Fix: salted keys, increased shard count, multi-tenant isolation.
Cross-shard transactions: expensive (2PC, distributed locking). Design data models to keep related data in the same shard wherever possible.
Replication lag spikes: replicas fall behind during bulk writes; reads served from lagging replicas return stale data. Monitor lag in seconds or bytes; fail reads to leader past a threshold.
Quorum Write Flow
Self-Check Quiz
N=5 Cassandra cluster. You write with CL=QUORUM, then immediately read with CL=ONE. Can you see your write? (Answer: not guaranteed. CL=ONE returns the first replica response, which may be a stale one. For strong consistency, you need W+R>N — e.g. CL=QUORUM on both.)
Your single celebrity user generates 90% of writes for one shard. What is happening, and what is the fix? (Answer: hot partition. Fix: split the key with salting (user_id:0..N) and aggregate at read time, or scale out and use a different partitioning key.)
Why does range-partitioned MongoDB sometimes accidentally create hot shards while hash-partitioned Cassandra rarely does? (Answer: range partitioning concentrates time-prefix or sequential keys on one shard; hash partitioning randomises. Choose the partition strategy based on access pattern.)
You see replication lag of 30 seconds on a Postgres read replica. What three actions matter most? (Answer: alert if it exceeds threshold; route latency-sensitive reads to leader; investigate root cause — bulk writes, slow disk, or vacuum.)
For the algorithm-level treatment of consistent hashing, vector clocks, CRDTs, and quorum proofs, read the Distributed Systems Algorithms guide. For Kubernetes-specific operational patterns the Kubernetes cheatsheet is the fast reference.
Real world
Where this shows up
-Cassandra at Netflix replicates user data across multiple regions with LOCAL_QUORUM for low-latency reads.
-DynamoDB Global Tables provide multi-region active-active with last-writer-wins by default.
-CockroachDB uses per-range Raft groups to provide globally-consistent SQL with horizontal scale.
-Discord migrated from MongoDB to Cassandra (then to ScyllaDB) for their messages workload due to Cassandra's leaderless write throughput.
Production notes
Keep these close
!Replication lag is a first-class metric to alert on. Past a threshold, fail reads back to the leader rather than serve stale data.
!Hot partitions are the #1 distributed-data scalability bug. Detect via per-shard QPS dashboards; mitigate via key salting or tenant-aware shard routing.
!Cross-shard transactions are expensive (2PC, distributed locking). Design data models to keep related data co-located in the same shard.
Common mistakes
What usually breaks
!Choosing eventual consistency without thinking through the read-your-writes anomaly.
!Single-region Cassandra with consistency level QUORUM — works fine until you go multi-region and discover the cross-region quorum cost.
!Treating replicas as a read-scale solution when replication lag means stale reads.
Security risks
Threats to watch
!Multi-tenant sharded systems leak data when the application forgets to scope queries by tenant_id; design tenant scoping into the storage layer, not the app.
!Read replicas with different RBAC than the primary expose data the security team thought was protected.
!Backups of distributed databases multiply the data-leak surface; encrypt at rest with KMS, audit access.
!Cross-region replication over public internet must use TLS; a leaked snapshot in transit is a leaked database.
Tradeoffs
Design choices you should be able to defend
Single-leader (Postgres, MySQL)
Pros
+Strong consistency on the leader
+Simple to reason about
+Atomic transactions
Cons
-Leader is a write bottleneck
-Failover is the SPOF reconciliation problem
-Read replicas have lag
Leaderless (Cassandra, DynamoDB)
Pros
+No single bottleneck
+High write throughput
+Tunable consistency
Cons
-Complex consistency story
-No cross-key transactions
-Operational complexity
Multi-leader (CRDT-backed, multi-region MySQL)
Pros
+Local writes in every region
+No coordination latency
Cons
-Conflict resolution required
-Hard to reason about
Alternatives
Other production approaches
Sharded Postgres (Citus, Vitess)
Postgres with horizontal sharding; keep SQL surface, add scale.
Apache Cassandra / ScyllaDB
Leaderless, hash-partitioned, AP-leaning; best for high write throughput.
CockroachDB / TiDB / YugabyteDB
Distributed SQL with per-range consensus; SQL surface + horizontal scale.
Spanner / AlloyDB
Globally-consistent SQL; premium pricing for premium consistency.
DynamoDB
Managed, hash-partitioned; great if you stay within its access patterns, painful if you fight them.
Think like an engineer
Questions to answer before shipping
?Choose partition keys based on the access pattern, not the storage convenience. Keys that are easy to write but hard to query become technical debt.
?For multi-tenant systems, design for the worst tenant. The 90th-percentile customer is your bottleneck before it's your average.
?Consistency level is per query, not per database. Force engineers to make the choice deliberately for each operation.
Key terms
Vocabulary used in this module
Sharding
Splitting data across multiple machines by partitioning the key space.
Quorum
Minimum number of nodes that must respond for an operation to be considered successful.
Replication lag
How far behind the leader a replica is; the staleness window for reads from that replica.
Hot partition
A partition with disproportionately high traffic, overloading one node.
Eventual consistency
Replicas converge to the same state if writes stop; the weakest useful guarantee.