Skip to main content

Module 2: Networking & Distributed Communication Slides

Slide walkthrough for Module 2 of Distributed Systems Engineering: Building Scalable, Reliable & Secure Systems: How services actually talk: TCP, HTTP/2,...

This slide page is the visual review companion for the full course module. Use it to recap the architecture, examples, exercises, production warnings, and takeaways after reading the lesson.

Slide Outline

  1. Networking & Distributed Communication - How services actually talk: TCP, HTTP/2, gRPC, service discovery, load balancing, retries, and the timeout discipline that keeps systems from melting.
  2. Learning Objectives - 5 outcomes for this module
  3. Why This Module Matters - The patterns in this module are the difference between a service that survives a bad day and one that cascades into a mu
  4. Before vs After - The operational shift this module teaches
  5. TCP/IP - What Lives Underneath - Lesson section from the full module
  6. HTTP/1.1 vs HTTP/2 vs gRPC - What Each Buys You - Lesson section from the full module
  7. Service Discovery - How Services Find Each Other - Lesson section from the full module
  8. Load Balancing - Lesson section from the full module
  9. Retries, Timeouts, and the Storm - Lesson section from the full module
  10. Connection Pooling - Lesson section from the full module
  11. DNS as Distributed-Systems Risk - Lesson section from the full module
  12. TLS Handshake Sequence - Lesson section from the full module
  13. Real-World Use Cases - Google uses gRPC internally for nearly all service-to-service traffic (it was open-sourced from their internal Stubby framework)., Cloudflare reduced internal latency by ~30% by moving from HTTP/1.1 to HTTP/2 with connection pooling.
  14. Common Mistakes to Avoid - 3 mistakes covered
  15. Production Notes - 3 practical notes
  16. Security Risks to Watch - 4 risks covered
  17. Hands-On Labs - 3 hands-on labs
  18. Key Takeaways - 5 points to remember

Learning Objectives

  • Read a TCP/IP packet flow and explain what each layer does in production
  • Compare HTTP/1.1, HTTP/2, and gRPC and pick the right one per workload
  • Implement service discovery without inventing a worse DNS
  • Design retry, timeout, and load-balancing policies that survive load
  • Diagnose and prevent retry storms before they cause outages

Why This Module Matters

The patterns in this module are the difference between a service that survives a bad day and one that cascades into a multi-team incident. Engineers who internalise timeouts, retries with budgets, and connection pooling can read an incident timeline and immediately see where the design failed. Engineers who skip them tend to debug the same outage repeatedly.

Production Notes

  • Set per-call timeouts at every layer. Default of “wait forever” in standard libraries is the source of half of all production stalls.
  • Run NodeLocal DNSCache on every Kubernetes cluster. The cost is one DaemonSet; the benefit is dropping DNS off the data path.
  • Treat retry policies as part of the service contract; document them and review at deployment.

Common Mistakes

  • Setting infinite retries on a non-idempotent endpoint — one downstream blip becomes duplicate side effects everywhere.
  • Not setting per-call timeouts; the system inherits the default of “wait forever”.
  • Using HTTP/1.1 for high-throughput internal communication; you pay for handshakes you do not need.

Key Takeaways

  • Most distributed-systems incidents are network incidents that look like application bugs
  • Use gRPC for service-to-service, HTTP/JSON for external; avoid HTTP/1.1 internally
  • Every call has a timeout. Every retry has exponential backoff with jitter. Every retry policy has a budget
  • Service discovery is mandatory infrastructure — pick DNS, registry, or mesh deliberately
  • DNS is the cause of more unexplained outages than any other layer

Hands-On Labs

  1. Lab 2.1 — gRPC vs REST Latency Bake-off

    Measure real latency and throughput of gRPC vs HTTP/JSON for the same logical workload.

    60 minutes - Beginner

    • Implement the same service interface as gRPC and HTTP/JSON
    • Generate identical client and server code
    • Run a 5-minute load test at 100 / 1000 / 5000 RPS
    • Capture p50/p95/p99 latency, throughput, CPU usage
    • Compare wire size for a representative request

    View lab files on GitHub

  2. Lab 2.2 — Retry Storm Reproduction and Defence

    Cause and contain a retry storm in a controlled environment.

    90 minutes - Intermediate

    • Stand up a 3-service chain
    • Inject a 50% error rate at the bottom service
    • Configure callers with naive retries (no backoff, no budget)
    • Observe the QPS amplification
    • Add exponential backoff with jitter; observe
    • Add a retry budget; observe full recovery

    View lab files on GitHub

  3. Lab 2.3 — DNS-Caused Outage Triage

    Reproduce a stale-DNS outage and walk through the triage flow.

    45 minutes - Intermediate

    • Deploy a service with DNS TTL 300s
    • Move the service to a new IP
    • Watch existing clients fail until cache expires
    • Reproduce with TTL 5s and observe smooth handoff
    • Document the runbook

    View lab files on GitHub

Read the full module | Back to course curriculum