Module 2: Networking & Distributed Communication
How services actually talk: TCP, HTTP/2, gRPC, service discovery, load balancing, retries, and the timeout discipline that keeps systems from melting.
4 hours. 3 hands-on labs. Free course module.
Learning Objectives
- Read a TCP/IP packet flow and explain what each layer does in production
- Compare HTTP/1.1, HTTP/2, and gRPC and pick the right one per workload
- Implement service discovery without inventing a worse DNS
- Design retry, timeout, and load-balancing policies that survive load
- Diagnose and prevent retry storms before they cause outages
Why This Matters
The patterns in this module are the difference between a service that survives a bad day and one that cascades into a multi-team incident. Engineers who internalise timeouts, retries with budgets, and connection pooling can read an incident timeline and immediately see where the design failed. Engineers who skip them tend to debug the same outage repeatedly.
Lesson Content
Two services talking is not one network call. It is, on a typical Kubernetes cluster, a TLS handshake, a DNS lookup, a service-mesh sidecar interception, an L4 load-balancer pick, an L7 retry policy, an actual HTTP/2 stream, deserialization on the receiver, and an audit log on the way back. Most distributed-systems incidents are not algorithm bugs — they are network bugs that look like algorithm bugs.
This module unpacks the stack so the next time your p99 latency doubles you know which layer to suspect.
TCP/IP — What Lives Underneath
The two-line summary every backend engineer needs: TCP gives you reliability (ordered delivery, retransmission, flow control) and connection state (the three-way handshake takes 1 RTT before the first byte of payload). IP gives you routing (each packet finds its way through a graph of routers without the endpoints knowing the path).
The TCP three-way handshake (SYN → SYN/ACK → ACK) is the per-connection latency floor. TLS adds another 1–2 RTTs for the handshake. So the first request on a fresh connection costs you 2–3 RTTs of pure overhead before anything useful happens. That is why connection pooling matters: amortise the handshake cost across many requests on the same connection.
HTTP/1.1 vs HTTP/2 vs gRPC — What Each Buys You
- HTTP/1.1: text-based request/response, head-of-line blocking on a single connection. Workaround: clients open many connections in parallel. Still the right answer for cacheable static content and many web APIs.
- HTTP/2: binary framing, multiplexed streams over a single connection (no head-of-line blocking at HTTP level), header compression (HPACK). Same one-connection-many-requests philosophy. Required by gRPC.
- gRPC: an RPC framework on top of HTTP/2 with Protocol Buffers (Protobuf) serialization. Strongly-typed interfaces, code generation in many languages, streaming RPCs, deadlines built into the protocol. The de-facto choice for service-to-service communication in modern infrastructure.
The practical guidance: use gRPC for service-to-service calls in your own infra (typed contracts, low overhead, streaming when you need it). Use HTTP/JSON for external APIs (browser-callable, tool-friendly, debuggable with curl). Avoid HTTP/1.1 for internal traffic unless you have a specific reason.
Service Discovery — How Services Find Each Other
Static IP addresses do not work in a world where pods restart, scale up, or move between nodes every few minutes. Service discovery is the indirection: clients ask “where is the orders service?” and get back the current set of healthy endpoints.
The mainstream patterns:
- DNS-based — Kubernetes Services give every service a DNS name (
orders.payments.svc.cluster.local) that resolves to the current Pod IPs. Simple, integrates with everything, but DNS TTL caching can lag. - Service registry — Consul, etcd, ZooKeeper. Services register on startup; clients query the registry. Fast updates; richer metadata (datacentre, health, weights).
- Service mesh — the sidecar (Envoy) handles discovery via xDS protocol from a control plane (Istio, Linkerd). Clients call
ordersas if it were local; the sidecar resolves the actual endpoints.
Load Balancing
Load balancers turn a list of endpoints into a single virtual endpoint with traffic distribution. Two layers, distinct trade-offs:
- L4 load balancers (AWS NLB, kube-proxy iptables/IPVS) operate at the TCP layer. Cheap, fast, opaque to the application. Best for raw connection distribution; cannot do per-request routing.
- L7 load balancers (Envoy, NGINX, AWS ALB) understand HTTP. Can do header-based routing, path matching, retries, weighted shifting, mTLS termination, observability. Add latency (~1–5ms) but unlock the production-engineering toolbox.
Algorithm choice matters. Round robin is fine for uniform endpoints; least-request handles variable backend speed (Envoy default); EWMA tracks a smoothed latency estimate and prefers fast endpoints; ring hash / consistent hash sticks the same key to the same backend (useful for cache locality).
Retries, Timeouts, and the Storm
Two rules that, applied with discipline, prevent most outages:
- Every call has a timeout. Default is “wait forever” in most languages. Override it. The timeout should be shorter than your caller's timeout (so retries can fire within the deadline budget).
- Every retry has exponential backoff with jitter. Wait 1s, then 2s, then 4s — with random jitter to avoid synchronising retries. AWS's “Decorrelated Jitter” algorithm is the standard.
The retry storm is the canonical anti-pattern: a backend brownout causes timeouts; clients retry; their retries push more load onto the backend; the backend cannot recover; clients keep retrying. The defence is the retry budget: cap retries at a percentage of total RPS (e.g. retries cannot exceed 10% of in-flight requests). Envoy and gRPC client libraries support this directly.
The Rate Limiting Algorithms guide covers the related primitive of capping arrival rate; combined with retry budgets, it is the front-of-house resilience kit.
Connection Pooling
HTTP/2 lets one TCP connection carry many requests. For high-throughput service-to-service calls, every client should hold an open pool of connections to each upstream and reuse them. Pool sizing rules of thumb:
- Min pool:
p99_concurrency × 1.2(avoid head-of-line blocking on the bottom). - Max pool: large enough to avoid queueing under burst, small enough to not exhaust the upstream's file descriptors.
- Idle timeout: 30–60s. Short enough to recover from broken connections; long enough to amortise handshake.
DNS as Distributed-Systems Risk
DNS is the cause of more “unexplained” outages than any other piece of distributed-systems infrastructure. Common failure modes:
- Resolver caches a stale entry; service is moved; clients call dead endpoints for the TTL window.
- Coredns / kube-dns hits a query-rate limit; lookups time out; entire mesh stalls.
- External resolver (8.8.8.8) is unreachable; in-cluster lookups slow because of fallback chains.
Mitigations: short TTLs for in-cluster names (5–30s), use NodeLocal DNSCache on Kubernetes, monitor DNS error rate as a first-class metric, prefer service-mesh discovery (sidecar handles endpoint changes via xDS, no DNS in the data path). For Kubernetes-specific networking patterns reach for the Kubernetes cheatsheet; for service-mesh pattern reference the Service Mesh cheatsheet; for API gateway / external API security the API Security cheatsheet.
TLS Handshake Sequence
DNS Resolution Flow on Kubernetes
Load Balancer Architecture
Retry Storm Propagation
Self-Check Quiz
- Why does HTTP/1.1 lead to many TCP connections per browser tab while HTTP/2 needs only one? (Answer: HTTP/1.1 has head-of-line blocking on a single connection; clients open multiple connections to parallelise. HTTP/2 multiplexes streams over one connection.)
- You set
retries: 3on every internal call. A downstream brownout begins. What happens? (Answer: a retry storm; total RPS to the failing service is 4x normal, preventing recovery. Need a retry budget capping retries at, say, 10% of RPS.) - What is the right service-discovery pattern for a 50-service Kubernetes cluster? (Answer: Kubernetes Services for in-cluster DNS by default; service mesh sidecar via xDS for richer mesh-aware discovery if you already run a mesh.)
- Your p99 latency tripled overnight. The application code did not change. Where do you look first? (Answer: DNS error rate, recent CoreDNS changes, NodeLocal DNSCache health. DNS is the biggest source of unexplained latency anomalies in Kubernetes.)
- When should you choose gRPC over HTTP/JSON? (Answer: service-to-service inside your infra where you control both sides and want strongly-typed contracts; not for browser-callable APIs where HTTP/JSON wins on tooling.)
Real-World Use Cases
- Google uses gRPC internally for nearly all service-to-service traffic (it was open-sourced from their internal Stubby framework).
- Cloudflare reduced internal latency by ~30% by moving from HTTP/1.1 to HTTP/2 with connection pooling.
- Netflix's Hystrix circuit-breaker library (now retired in favour of Resilience4j) was created after a single dependency outage cascaded the entire viewing platform.
- AWS published the “decorrelated jitter” backoff algorithm after measuring how poorly synchronised retries handled their load.
Production Notes
- Set per-call timeouts at every layer. Default of “wait forever” in standard libraries is the source of half of all production stalls.
- Run NodeLocal DNSCache on every Kubernetes cluster. The cost is one DaemonSet; the benefit is dropping DNS off the data path.
- Treat retry policies as part of the service contract; document them and review at deployment.
Common Mistakes
- Setting infinite retries on a non-idempotent endpoint — one downstream blip becomes duplicate side effects everywhere.
- Not setting per-call timeouts; the system inherits the default of “wait forever”.
- Using HTTP/1.1 for high-throughput internal communication; you pay for handshakes you do not need.
Security Risks to Watch
- Plaintext HTTP between services exposes payloads to passive network attackers. Default to mTLS for any internal call carrying sensitive data.
- Trusting X-Forwarded-For from untrusted upstream is a rate-limit / IP-allowlist bypass vector.
- Service discovery without authentication lets a compromised pod register fake endpoints and intercept traffic.
- gRPC reflection enabled in production exposes internal API schema to anyone who can hit the endpoint.
Design Tradeoffs
gRPC for service-to-service
Pros
- Strong typing via Protobuf
- 2-5x lower wire size than JSON
- Streaming RPCs
- Built-in deadlines
Cons
- Harder to debug than HTTP/JSON
- Browser support requires gRPC-Web
- Smaller ecosystem than REST
HTTP/JSON for service-to-service
Pros
- Universal tooling (curl, Postman)
- Browser-callable
- Easy to log/debug
Cons
- Verbose wire format
- No built-in deadlines
- Weak typing
Service mesh (Envoy/Istio)
Pros
- mTLS / retries / circuit breakers free
- Centralised policy
- Rich observability
Cons
- Sidecar latency tax (~1-3ms/hop)
- Operational complexity
- Learning curve
Production Alternatives
- gRPC: Strongly-typed RPC over HTTP/2; standard for service-to-service in modern infra.
- Connect / Twirp: Lighter alternatives to gRPC with simpler tooling; trade ecosystem maturity for ergonomics.
- GraphQL Federation: For client-facing APIs that need composition across many backend services.
- NATS / message-based RPC: When you need request-response without TCP connection overhead; useful for IoT/edge.
Think Like an Engineer
- Before debating gRPC vs REST, ask: who calls this API? If browsers, REST. If your own services, gRPC unless there is a reason against.
- Every retry policy is a load multiplier. Calculate the worst-case load on the downstream when every caller hits its retry cap simultaneously.
- DNS is in the data path on every request. Treat its latency and error rate as first-class metrics, not infrastructure noise.
Production Story
A consumer-facing API team enabled aggressive client-side retries (3 attempts, no backoff) after seeing transient 503s in CI. Two weeks later their payments backend went into brownout. Within minutes the retry logic amplified normal traffic 4x; the backend could not recover; the entire mobile app was down for 22 minutes. Post-mortem: add retry budget (cap retries at 10% of RPS), exponential backoff with jitter, and circuit breaker on the client side. The fix was 50 lines of code and a config change.
Key Terms
- gRPC
- Open-source RPC framework using HTTP/2 + Protocol Buffers; standard for service-to-service in modern infra.
- Connection pooling
- Reusing TCP/HTTP connections across requests to amortise handshake cost.
- Retry storm
- Failure mode where retries amplify load on an already-struggling backend, preventing recovery.
- Retry budget
- Cap on total retries as a percentage of RPS; prevents retry storms.
- Service discovery
- Mechanism by which clients find healthy endpoints for a service (DNS, registry, mesh).
Hands-On Labs
-
Lab 2.1 — gRPC vs REST Latency Bake-off
Measure real latency and throughput of gRPC vs HTTP/JSON for the same logical workload.
60 minutes - Beginner
- Implement the same service interface as gRPC and HTTP/JSON
- Generate identical client and server code
- Run a 5-minute load test at 100 / 1000 / 5000 RPS
- Capture p50/p95/p99 latency, throughput, CPU usage
- Compare wire size for a representative request
-
Lab 2.2 — Retry Storm Reproduction and Defence
Cause and contain a retry storm in a controlled environment.
90 minutes - Intermediate
- Stand up a 3-service chain
- Inject a 50% error rate at the bottom service
- Configure callers with naive retries (no backoff, no budget)
- Observe the QPS amplification
- Add exponential backoff with jitter; observe
- Add a retry budget; observe full recovery
-
Lab 2.3 — DNS-Caused Outage Triage
Reproduce a stale-DNS outage and walk through the triage flow.
45 minutes - Intermediate
- Deploy a service with DNS TTL 300s
- Move the service to a new IP
- Watch existing clients fail until cache expires
- Reproduce with TTL 5s and observe smooth handoff
- Document the runbook
Key Takeaways
- Most distributed-systems incidents are network incidents that look like application bugs
- Use gRPC for service-to-service, HTTP/JSON for external; avoid HTTP/1.1 internally
- Every call has a timeout. Every retry has exponential backoff with jitter. Every retry policy has a budget
- Service discovery is mandatory infrastructure — pick DNS, registry, or mesh deliberately
- DNS is the cause of more unexplained outages than any other layer