Module 2 of 12

Networking & Distributed Communication

How services actually talk: TCP, HTTP/2, gRPC, service discovery, load balancing, retries, and the timeout discipline that keeps systems from melting.

4 hours3 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Read a TCP/IP packet flow and explain what each layer does in production
Compare HTTP/1.1, HTTP/2, and gRPC and pick the right one per workload
Implement service discovery without inventing a worse DNS
Design retry, timeout, and load-balancing policies that survive load
Diagnose and prevent retry storms before they cause outages

Before

No timeouts on outbound calls; one slow downstream stalls the entire service
Naive retry policies that amplify failures during brownouts
DNS caching tuned by accident; outages from stale entries
Long-running HTTP/1.1 connections; handshake overhead on every cold call

After

Per-call timeouts at every layer, shorter than caller's deadline
Retries with exponential backoff + jitter + budget; no amplification
NodeLocal DNSCache + short TTLs + DNS error rate alerted
gRPC over HTTP/2 with connection pooling; handshake cost amortised

Two services talking is not one network call. It is, on a typical Kubernetes cluster, a TLS handshake, a DNS lookup, a service-mesh sidecar interception, an L4 load-balancer pick, an L7 retry policy, an actual HTTP/2 stream, deserialization on the receiver, and an audit log on the way back. Most distributed-systems incidents are not algorithm bugs - they are network bugs that look like algorithm bugs.

This module unpacks the stack so the next time your p99 latency doubles you know which layer to suspect.

TCP/IP - What Lives Underneath

The two-line summary every backend engineer needs: TCP gives you reliability (ordered delivery, retransmission, flow control) and connection state (the three-way handshake takes 1 RTT before the first byte of payload). IP gives you routing (each packet finds its way through a graph of routers without the endpoints knowing the path).

The TCP three-way handshake (SYN → SYN/ACK → ACK) is the per-connection latency floor. TLS adds another 1–2 RTTs for the handshake. So the first request on a fresh connection costs you 2–3 RTTs of pure overhead before anything useful happens. That is why connection pooling matters: amortise the handshake cost across many requests on the same connection.

HTTP/1.1 vs HTTP/2 vs gRPC - What Each Buys You

HTTP/1.1: text-based request/response, head-of-line blocking on a single connection. Workaround: clients open many connections in parallel. Still the right answer for cacheable static content and many web APIs.
HTTP/2: binary framing, multiplexed streams over a single connection (no head-of-line blocking at HTTP level), header compression (HPACK). Same one-connection-many-requests philosophy. Required by gRPC.
gRPC: an RPC framework on top of HTTP/2 with Protocol Buffers (Protobuf) serialization. Strongly-typed interfaces, code generation in many languages, streaming RPCs, deadlines built into the protocol. The de-facto choice for service-to-service communication in modern infrastructure.

The practical guidance: use gRPC for service-to-service calls in your own infra (typed contracts, low overhead, streaming when you need it). Use HTTP/JSON for external APIs (browser-callable, tool-friendly, debuggable with curl). Avoid HTTP/1.1 for internal traffic unless you have a specific reason.

Service Discovery - How Services Find Each Other

Static IP addresses do not work in a world where pods restart, scale up, or move between nodes every few minutes. Service discovery is the indirection: clients ask “where is the orders service?” and get back the current set of healthy endpoints.

The mainstream patterns:

DNS-based - Kubernetes Services give every service a DNS name (orders.payments.svc.cluster.local) that resolves to the current Pod IPs. Simple, integrates with everything, but DNS TTL caching can lag.
Service registry - Consul, etcd, ZooKeeper. Services register on startup; clients query the registry. Fast updates; richer metadata (datacentre, health, weights).
Service mesh - the sidecar (Envoy) handles discovery via xDS protocol from a control plane (Istio, Linkerd). Clients call orders as if it were local; the sidecar resolves the actual endpoints.

Load Balancing

Load balancers turn a list of endpoints into a single virtual endpoint with traffic distribution. Two layers, distinct trade-offs:

L4 load balancers (AWS NLB, kube-proxy iptables/IPVS) operate at the TCP layer. Cheap, fast, opaque to the application. Best for raw connection distribution; cannot do per-request routing.
L7 load balancers (Envoy, NGINX, AWS ALB) understand HTTP. Can do header-based routing, path matching, retries, weighted shifting, mTLS termination, observability. Add latency (~1–5ms) but unlock the production-engineering toolbox.

Algorithm choice matters. Round robin is fine for uniform endpoints; least-request handles variable backend speed (Envoy default); EWMA tracks a smoothed latency estimate and prefers fast endpoints; ring hash / consistent hash sticks the same key to the same backend (useful for cache locality).

Retries, Timeouts, and the Storm

Two rules that, applied with discipline, prevent most outages:

Every call has a timeout. Default is “wait forever” in most languages. Override it. The timeout should be shorter than your caller's timeout (so retries can fire within the deadline budget).
Every retry has exponential backoff with jitter. Wait 1s, then 2s, then 4s - with random jitter to avoid synchronising retries. AWS's “Decorrelated Jitter” algorithm is the standard.

The retry storm is the canonical anti-pattern: a backend brownout causes timeouts; clients retry; their retries push more load onto the backend; the backend cannot recover; clients keep retrying. The defence is the retry budget: cap retries at a percentage of total RPS (e.g. retries cannot exceed 10% of in-flight requests). Envoy and gRPC client libraries support this directly.

The Rate Limiting Algorithms guide covers the related primitive of capping arrival rate; combined with retry budgets, it is the front-of-house resilience kit.

Connection Pooling

HTTP/2 lets one TCP connection carry many requests. For high-throughput service-to-service calls, every client should hold an open pool of connections to each upstream and reuse them. Pool sizing rules of thumb:

Min pool: p99_concurrency × 1.2 (avoid head-of-line blocking on the bottom).
Max pool: large enough to avoid queueing under burst, small enough to not exhaust the upstream's file descriptors.
Idle timeout: 30–60s. Short enough to recover from broken connections; long enough to amortise handshake.

DNS as Distributed-Systems Risk

DNS is the cause of more “unexplained” outages than any other piece of distributed-systems infrastructure. Common failure modes:

Resolver caches a stale entry; service is moved; clients call dead endpoints for the TTL window.
Coredns / kube-dns hits a query-rate limit; lookups time out; entire mesh stalls.
External resolver (8.8.8.8) is unreachable; in-cluster lookups slow because of fallback chains.

Mitigations: short TTLs for in-cluster names (5–30s), use NodeLocal DNSCache on Kubernetes, monitor DNS error rate as a first-class metric, prefer service-mesh discovery (sidecar handles endpoint changes via xDS, no DNS in the data path). For Kubernetes-specific networking patterns reach for the Kubernetes cheatsheet; for service-mesh pattern reference the Service Mesh cheatsheet; for API gateway / external API security the API Security cheatsheet.

TLS Handshake Sequence

DNS Resolution Flow on Kubernetes

Load Balancer Architecture

Retry Storm Propagation

Self-Check Quiz

Why does HTTP/1.1 lead to many TCP connections per browser tab while HTTP/2 needs only one? (Answer: HTTP/1.1 has head-of-line blocking on a single connection; clients open multiple connections to parallelise. HTTP/2 multiplexes streams over one connection.)
You set retries: 3 on every internal call. A downstream brownout begins. What happens? (Answer: a retry storm; total RPS to the failing service is 4x normal, preventing recovery. Need a retry budget capping retries at, say, 10% of RPS.)
What is the right service-discovery pattern for a 50-service Kubernetes cluster? (Answer: Kubernetes Services for in-cluster DNS by default; service mesh sidecar via xDS for richer mesh-aware discovery if you already run a mesh.)
Your p99 latency tripled overnight. The application code did not change. Where do you look first? (Answer: DNS error rate, recent CoreDNS changes, NodeLocal DNSCache health. DNS is the biggest source of unexplained latency anomalies in Kubernetes.)
When should you choose gRPC over HTTP/JSON? (Answer: service-to-service inside your infra where you control both sides and want strongly-typed contracts; not for browser-callable APIs where HTTP/JSON wins on tooling.)

Real world

Where this shows up

Google uses gRPC internally for nearly all service-to-service traffic (it was open-sourced from their internal Stubby framework).
Cloudflare reduced internal latency by ~30% by moving from HTTP/1.1 to HTTP/2 with connection pooling.
Netflix's Hystrix circuit-breaker library (now retired in favour of Resilience4j) was created after a single dependency outage cascaded the entire viewing platform.
AWS published the “decorrelated jitter” backoff algorithm after measuring how poorly synchronised retries handled their load.

Production notes

Keep these close

Set per-call timeouts at every layer. Default of “wait forever” in standard libraries is the source of half of all production stalls.
Run NodeLocal DNSCache on every Kubernetes cluster. The cost is one DaemonSet; the benefit is dropping DNS off the data path.
Treat retry policies as part of the service contract; document them and review at deployment.

Common mistakes

What usually breaks

Setting infinite retries on a non-idempotent endpoint - one downstream blip becomes duplicate side effects everywhere.
Not setting per-call timeouts; the system inherits the default of “wait forever”.
Using HTTP/1.1 for high-throughput internal communication; you pay for handshakes you do not need.

Security risks

Threats to watch

Plaintext HTTP between services exposes payloads to passive network attackers. Default to mTLS for any internal call carrying sensitive data.
Trusting X-Forwarded-For from untrusted upstream is a rate-limit / IP-allowlist bypass vector.
Service discovery without authentication lets a compromised pod register fake endpoints and intercept traffic.
gRPC reflection enabled in production exposes internal API schema to anyone who can hit the endpoint.

Tradeoffs

Design choices you should be able to defend

gRPC for service-to-service

Pros

Strong typing via Protobuf
2-5x lower wire size than JSON
Streaming RPCs
Built-in deadlines

Cons

Harder to debug than HTTP/JSON
Browser support requires gRPC-Web
Smaller ecosystem than REST

HTTP/JSON for service-to-service

Pros

Universal tooling (curl, Postman)
Browser-callable
Easy to log/debug

Cons

Verbose wire format
No built-in deadlines
Weak typing

Service mesh (Envoy/Istio)

Pros

mTLS / retries / circuit breakers free
Centralised policy
Rich observability

Cons

Sidecar latency tax (~1-3ms/hop)
Operational complexity
Learning curve

Alternatives

Other production approaches

gRPC

Strongly-typed RPC over HTTP/2; standard for service-to-service in modern infra.

Connect / Twirp

Lighter alternatives to gRPC with simpler tooling; trade ecosystem maturity for ergonomics.

GraphQL Federation

For client-facing APIs that need composition across many backend services.

NATS / message-based RPC

When you need request-response without TCP connection overhead; useful for IoT/edge.

Think like an engineer

Questions to answer before shipping

Before debating gRPC vs REST, ask: who calls this API? If browsers, REST. If your own services, gRPC unless there is a reason against.
Every retry policy is a load multiplier. Calculate the worst-case load on the downstream when every caller hits its retry cap simultaneously.
DNS is in the data path on every request. Treat its latency and error rate as first-class metrics, not infrastructure noise.

Key terms

Vocabulary used in this module

gRPC

Open-source RPC framework using HTTP/2 + Protocol Buffers; standard for service-to-service in modern infra.

Connection pooling

Reusing TCP/HTTP connections across requests to amortise handshake cost.

Retry storm

Failure mode where retries amplify load on an already-struggling backend, preventing recovery.

Retry budget

Cap on total retries as a percentage of RPS; prevents retry storms.

Service discovery

Mechanism by which clients find healthy endpoints for a service (DNS, registry, mesh).

Labs

Hands-on labs

60 minutesBeginner

Lab 2.1 - gRPC vs REST Latency Bake-off

Measure real latency and throughput of gRPC vs HTTP/JSON for the same logical workload.

Implement the same service interface as gRPC and HTTP/JSON
Generate identical client and server code
Run a 5-minute load test at 100 / 1000 / 5000 RPS
Capture p50/p95/p99 latency, throughput, CPU usage
Compare wire size for a representative request

View lab on GitHub

90 minutesIntermediate

Lab 2.2 - Retry Storm Reproduction and Defence

Cause and contain a retry storm in a controlled environment.

Stand up a 3-service chain
Inject a 50% error rate at the bottom service
Configure callers with naive retries (no backoff, no budget)
Observe the QPS amplification
Add exponential backoff with jitter; observe
Add a retry budget; observe full recovery

View lab on GitHub

45 minutesIntermediate

Lab 2.3 - DNS-Caused Outage Triage

Reproduce a stale-DNS outage and walk through the triage flow.

Deploy a service with DNS TTL 300s
Move the service to a new IP
Watch existing clients fail until cache expires
Reproduce with TTL 5s and observe smooth handoff
Document the runbook

View lab on GitHub

Recap

Key takeaways

Most distributed-systems incidents are network incidents that look like application bugs
Use gRPC for service-to-service, HTTP/JSON for external; avoid HTTP/1.1 internally
Every call has a timeout. Every retry has exponential backoff with jitter. Every retry policy has a budget
Service discovery is mandatory infrastructure - pick DNS, registry, or mesh deliberately
DNS is the cause of more unexplained outages than any other layer

Related resources

Networking & Distributed Communication

Learning objectives

TCP/IP - What Lives Underneath

HTTP/1.1 vs HTTP/2 vs gRPC - What Each Buys You

Service Discovery - How Services Find Each Other

Load Balancing

Retries, Timeouts, and the Storm

Connection Pooling

DNS as Distributed-Systems Risk

TLS Handshake Sequence

DNS Resolution Flow on Kubernetes

Load Balancer Architecture

Retry Storm Propagation

Self-Check Quiz

Where this shows up

Keep these close

What usually breaks

Threats to watch

Design choices you should be able to defend

gRPC for service-to-service

HTTP/JSON for service-to-service

Service mesh (Envoy/Istio)

Other production approaches

gRPC

Connect / Twirp

GraphQL Federation

NATS / message-based RPC

Questions to answer before shipping

Vocabulary used in this module

gRPC

Connection pooling

Retry storm

Retry budget

Service discovery

Hands-on labs

Lab 2.1 - gRPC vs REST Latency Bake-off

Lab 2.2 - Retry Storm Reproduction and Defence

Lab 2.3 - DNS-Caused Outage Triage

Key takeaways

Keep learning across CodersSecret

Related guides

Cheatsheets

Interactive labs

Glossary terms