How Kubernetes changes distributed-systems design — cluster architecture, service mesh, ingress, autoscaling, and the operational primitives that everything else now sits on top of.
-Use Services, Ingress, and Gateway API correctly for distributed workloads
-Compare service meshes (Istio, Linkerd, Cilium) and pick one with eyes open
-Run StatefulSets, PVCs, and storage classes for stateful workloads
-Operate workloads with HPA, VPA, Karpenter, and PodDisruptionBudgets in production
Before
-Pets-not-cattle; long-lived nodes that can't be replaced
-Shell scripts deploying directly to VMs; no declarative state
-No autoscaling; capacity provisioned for peak, idle most of the time
-Stateful workloads as single VMs; failure = data loss
After
+Cattle-not-pets; nodes are interchangeable and routinely cycled
+GitOps with Argo CD or Flux; declarative state, audited changes
+HPA + Karpenter; capacity scales with demand within minutes
+StatefulSets + PVCs + tested backups; node failure = pod reschedule, not data loss
Kubernetes is the substrate that runs most modern distributed systems. It is, itself, a distributed system — with consensus (etcd / Raft), partitioning (resources scheduled across nodes), replication (pods), and observability built in. Understanding how Kubernetes is constructed is now part of distributed-systems literacy.
Cluster Architecture
The control plane consists of:
kube-apiserver: the only component that writes to etcd; every other component talks to apiserver. Stateless; horizontally scalable.
etcd: the source of truth for cluster state; Raft-replicated; 3 or 5 nodes.
kube-scheduler: assigns pending pods to nodes based on resource fit, affinity, and topology.
kube-controller-manager: runs reconciliation loops (Deployment, ReplicaSet, Node, Endpoints, etc.). Each controller leader-elects via etcd.
kubelet: the node agent; pulls container images, runs containers via the container runtime, reports node and pod status to apiserver.
kube-proxy: implements Service abstraction via iptables / IPVS rules. (Modern alternative: Cilium with no kube-proxy.)
CNI plugin: pod networking (Cilium, Calico, AWS VPC CNI, etc.). Provides pod IPs, NetworkPolicy enforcement, often eBPF observability.
Container runtime: containerd or CRI-O; runs the containers.
Service, Ingress, Gateway API
Service: stable virtual IP for a set of pods; load-balances internal traffic; ClusterIP for in-cluster, LoadBalancer for cloud LB, NodePort for external on a port.
Ingress: L7 routing for HTTP/HTTPS; needs an Ingress Controller (nginx-ingress, AWS ALB, Envoy-based, etc.). Older API; many extensions baked into annotations.
Gateway API: the modern replacement for Ingress; richer, role-separated (Gateway/HTTPRoute/etc.), portable across implementations. The right choice for new infra.
Service Mesh
A service mesh adds a sidecar proxy (Envoy) to every pod to handle service-to-service: mTLS, retries, circuit breaking, observability, traffic shifting, authorization. Three major options:
Istio: most feature-rich; substantial complexity. Best when you need the full toolkit.
Linkerd: simpler, performance-focused; written in Rust; zero-config mTLS. Best for “mesh basics, fast”.
Cilium service mesh: eBPF-based, no sidecar, integrated with the Cilium CNI. Best when you want one tool for networking + mesh.
Service meshes implement most of the resilience patterns from Module 7 (timeouts, retries, circuit breakers) for free at the data plane. The cost is operational complexity and the latency tax of every request going through a sidecar.
Stateful Workloads
StatefulSets give pods stable identities (predictable name, predictable network address) and stable storage (PersistentVolumeClaims that follow the pod). The right pattern for databases, message queues, and any workload where pod identity matters.
StorageClasses define dynamic provisioning of PersistentVolumes from cloud-provider disks (EBS, PD, Azure Disk) or storage operators (Rook/Ceph, Longhorn). Choose access mode (ReadWriteOnce / ReadWriteMany), reclaim policy (Delete / Retain), and binding mode (Immediate / WaitForFirstConsumer) deliberately.
Autoscaling, PDBs, and Operational Sanity
HPA: scale pods on metrics. Always set minReplicas >= 2 for HA.
VPA: rightsize resource requests; clashes with HPA on the same metric.
Cluster Autoscaler / Karpenter: scale nodes. Karpenter is the modern default on AWS.
PodDisruptionBudget: cap the number of unavailable pods during voluntary disruption (drain, scale-down, eviction). Without PDBs, the autoscaler will happily evict every replica simultaneously.
Operational Practice
The Kubernetes operational discipline:
Always run multi-AZ for production. Use topologySpreadConstraints to enforce it.
etcd: 5 nodes across 3 AZs, KMS-backed encryption at rest, tested backup/restore.
RBAC: deny by default; explicit allow per ServiceAccount; treat cluster-admin as root.
NetworkPolicy: default-deny per namespace; explicit allow rules.
PodSecurity admission: restricted profile by default, exceptions audited.
You have a Deployment with 3 replicas. The cluster autoscaler scales down a node. All 3 pods on that node get evicted. Why? (Answer: no PodDisruptionBudget. Define maxUnavailable: 1 so only one replica goes down at a time.)
Postgres in a Deployment vs StatefulSet — what changes? (Answer: StatefulSet gives stable pod identity (postgres-0, postgres-1) and stable PVCs that follow each pod. Required for any stateful workload.)
Service mesh adds 1ms latency per hop. Across 5 hops you pay 5ms. When is it worth it? (Answer: when the mesh-provided features (mTLS, retries, observability, traffic shifting) are worth more than 5ms. For mature production systems, almost always.)
You enable Istio mesh-wide STRICT mTLS on day one of rollout. What happens? (Answer: external load balancer health probes fail; non-meshed services can no longer talk to meshed services; outage. Phase: PERMISSIVE first, observe, promote namespace by namespace.)
-Spotify runs over 1500 microservices on Kubernetes with a custom service mesh (Backstage / Apollo).
-Pinterest migrated their entire fleet to Kubernetes over 3 years; the migration was as much a culture shift as a technology one.
-Reddit runs everything on Kubernetes after a multi-year migration from EC2.
-Google's GKE Autopilot is essentially Kubernetes with the operational complexity hidden — for teams that want the API but not the infrastructure overhead.
Production notes
Keep these close
!Always run multi-AZ. Use topologySpreadConstraints to enforce it; do not rely on luck.
!PodDisruptionBudgets are mandatory for production workloads. Without them, autoscalers will happily evict every replica.
!etcd: 5 nodes across 3 AZs, KMS-backed encryption at rest, tested backup/restore.
!Resource requests at p95 of actual usage; do not let dev defaults of “500m CPU” ship to prod.
Common mistakes
What usually breaks
!Running stateful workloads as Deployments. Use StatefulSet so PVCs follow the pod identity.
!PodSecurity admission set to “privileged” in production namespaces. Use restricted with audited exceptions.
!Ingress per service in a flat namespace. Use Gateway API with role-separated Gateway/Route for new infra.
!Setting CPU limits = requests. CPU CFS throttling kicks in even when other cores are free; latency suffers.
Security risks
Threats to watch
!Default-permissive RBAC and PodSecurity; restricted profile must be explicit per-namespace.
!Service mesh sidecars run as elevated workloads; an unverified mesh component is a cluster-wide attack vector.
!Public LoadBalancer Services bypass NetworkPolicy; verify origin restrictions are enforced at the LB.
!Container image pulls without signature verification accept anything from the registry. Use cosign + admission policy.
Tradeoffs
Design choices you should be able to defend
Service mesh: Istio
Pros
+Most feature-rich
+Strong community
+Rich traffic management
Cons
-Heavy operationally
-Steeper learning curve
Service mesh: Linkerd
Pros
+Simpler
+Faster (Rust)
+Zero-config mTLS
Cons
-Fewer advanced features
Service mesh: Cilium (eBPF)
Pros
+Sidecar-free
+Integrated with CNI
+Lower latency tax
Cons
-Newer; ecosystem still maturing
No mesh; just K8s primitives
Pros
+Less operational complexity
+Lower latency
Cons
-No automatic mTLS
-Resilience patterns in app code
Alternatives
Other production approaches
Self-managed Kubernetes (kubeadm, RKE2, k0s)
Full control; full operational responsibility.
EKS / GKE / AKS
Managed control plane; you operate the workloads.
GKE Autopilot / EKS Auto Mode
Managed control plane AND nodes; closest to “just deploy a Pod”.
HashiCorp Nomad
Lighter alternative; works for non-container workloads.