Skip to main content

Module 4: Distributed Data Management Slides

Slide walkthrough for Module 4 of Distributed Systems Engineering: Building Scalable, Reliable & Secure Systems: How modern systems split, replicate, and...

This slide page is the visual review companion for the full course module. Use it to recap the architecture, examples, exercises, production warnings, and takeaways after reading the lesson.

Slide Outline

  1. Distributed Data Management - How modern systems split, replicate, and reconcile data across many machines — replication, sharding, quorums, consistency models, and the distributed databases that implement them.
  2. Learning Objectives - 5 outcomes for this module
  3. Why This Module Matters - Distributed data is the hardest part of distributed systems - once data is split across machines, every read and write h
  4. Before vs After - The operational shift this module teaches
  5. Replication Strategies - Lesson section from the full module
  6. Sharding Strategies - Lesson section from the full module
  7. Quorum Math - Lesson section from the full module
  8. Consistency Models in Practice - Lesson section from the full module
  9. Distributed Database Choices - Lesson section from the full module
  10. Common Pitfalls - Lesson section from the full module
  11. Quorum Write Flow - Lesson section from the full module
  12. Self-Check Quiz - Lesson section from the full module
  13. Real-World Use Cases - Cassandra at Netflix replicates user data across multiple regions with LOCAL_QUORUM for low-latency reads., DynamoDB Global Tables provide multi-region active-active with last-writer-wins by default.
  14. Common Mistakes to Avoid - 3 mistakes covered
  15. Production Notes - 3 practical notes
  16. Security Risks to Watch - 4 risks covered
  17. Hands-On Labs - 3 hands-on labs
  18. Key Takeaways - 5 points to remember

Learning Objectives

  • Pick between hash and range partitioning based on access patterns
  • Design replication strategies (single-leader, multi-leader, leaderless) and their failover behaviour
  • Apply quorum math (W + R > N) to choose consistency levels
  • Read a Cassandra / DynamoDB / PostgreSQL replication topology and predict its failure modes
  • Avoid the classic distributed-data anti-patterns: hot partitions, replication lag, write conflicts

Why This Module Matters

Distributed data is the hardest part of distributed systems — once data is split across machines, every read and write has to navigate replication lag, partition imbalance, and consistency trade-offs. Engineers who internalise the W+R>N rule, hot-partition mitigations, and the difference between sync and async replication design data layers that hold up. Engineers who skip the foundations end up reinventing distributed databases badly and debugging the same outages for years.

Production Notes

  • Replication lag is a first-class metric to alert on. Past a threshold, fail reads back to the leader rather than serve stale data.
  • Hot partitions are the #1 distributed-data scalability bug. Detect via per-shard QPS dashboards; mitigate via key salting or tenant-aware shard routing.
  • Cross-shard transactions are expensive (2PC, distributed locking). Design data models to keep related data co-located in the same shard.

Common Mistakes

  • Choosing eventual consistency without thinking through the read-your-writes anomaly.
  • Single-region Cassandra with consistency level QUORUM — works fine until you go multi-region and discover the cross-region quorum cost.
  • Treating replicas as a read-scale solution when replication lag means stale reads.

Key Takeaways

  • Replication is for durability and availability; sharding is for scale — you need both at scale
  • Quorum math (W + R > N) is the rule for strong consistency in leaderless systems
  • Hot partitions are the #1 distributed-data scalability bug; design key spaces deliberately
  • Cross-shard transactions are expensive; data models should make them rare
  • Replication lag is a first-class metric to alert on, not an implementation detail

Hands-On Labs

  1. Lab 4.1 — Postgres Streaming Replication + Failover

    Set up Postgres primary + replica, force failover, observe data loss window with sync vs async replication.

    60 minutes - Intermediate

    • Spin up primary + replica with docker-compose
    • Run async replication; cause primary crash mid-write; measure data loss
    • Switch to synchronous_commit=on; rerun; verify zero data loss

    View lab files on GitHub

  2. Lab 4.2 — Cassandra Quorum Behaviour

    Run a 3-node Cassandra cluster, vary consistency levels, observe behaviour during node failure.

    90 minutes - Intermediate

    • Spin up Cassandra cluster (3 nodes)
    • Write with CL=QUORUM; verify reads see latest
    • Kill one node; verify QUORUM still works
    • Kill two nodes; verify QUORUM fails; ONE still works (eventual)

    View lab files on GitHub

  3. Lab 4.3 — Hot Partition Reproduction

    Cause and mitigate a hot partition in Redis Cluster.

    45 minutes - Intermediate

    • Send 90% of traffic to one key
    • Observe per-node QPS; identify the hot node
    • Apply key salting (key:0..key:9); redistribute
    • Confirm load balances

    View lab files on GitHub

Read the full module | Back to course curriculum