Module 8: Distributed Security & Zero Trust
How modern distributed systems authenticate workload-to-workload — mTLS, SPIFFE/SPIRE, OPA, and the Zero Trust patterns that replace network-perimeter security.
5 hours. 3 hands-on labs. Free course module.
Learning Objectives
- Explain Zero Trust as an architectural principle, not a product
- Bootstrap mTLS between services with short-lived, automatically-rotated credentials
- Use SPIFFE/SPIRE to issue cryptographic workload identity at scale
- Enforce authorization with OPA / Rego at admission and at request time
- Federate trust across clusters and clouds without leaking secrets
Why This Matters
This is the differentiator module of this course. Most distributed-systems training treats security as a separate topic added at the end. In real production engineering, security is woven into every architectural decision — the choice between shared secrets and SPIFFE workload identity is the same scale of architectural choice as the choice between monolith and microservices. Engineers who internalise this model design systems that scale and stay secure together. Engineers who do not end up retrofitting security after the first incident.
Lesson Content
The classical security model assumed a trusted internal network behind a firewall. That assumption broke the moment one application talked to another over the internet, and it broke entirely with cloud-native architectures where workloads spin up and down across clusters, regions, and clouds in seconds. Zero Trust is the response: do not trust any caller based on network position; verify identity, posture, and policy on every request.
This module is the load-bearing security wall of distributed-systems engineering. After this you should be able to design how every internal API call authenticates, authorises, and audits itself, even across cluster and cloud boundaries.
Zero Trust in One Sentence
“Never trust, always verify, assume breach.” That is the operational summary. The architectural translation: every caller has a cryptographically verifiable identity; every authorization decision uses that identity plus context; every channel is encrypted; every action is logged; and the system is designed so a compromised component does not give the attacker the keys to the kingdom.
mTLS — The Secure Channel
Mutual TLS is the foundation: both client and server present certificates and verify each other's identity. Unlike server-side TLS (where only the server is identified), mTLS gives you bidirectional cryptographic identity on every connection.
The catch: mTLS is hard at scale because of credential management. Long-lived certificates leak, get committed to git, and never rotate. Short-lived certificates require an identity issuance system. That system is what SPIFFE/SPIRE provides.
SPIFFE / SPIRE
SPIFFE (Secure Production Identity Framework For Everyone) is a CNCF specification defining a universal format for workload identity:
- SPIFFE ID: a URI like
spiffe://example.com/ns/orders/sa/orders-apithat uniquely names a workload. - SVID (SPIFFE Verifiable Identity Document): a cryptographic document (X.509 certificate or JWT) that proves the holder owns the SPIFFE ID. Short-lived (minutes to an hour) and auto-rotated.
- Workload API: a Unix-socket API workloads use to fetch their current SVID. No application code touches secrets directly.
SPIRE is the reference implementation: a SPIRE Server issues SVIDs after a SPIRE Agent attests the workload via selectors (Kubernetes namespace, ServiceAccount, container image hash, etc.). The result: every workload has a unique cryptographic identity, automatically issued and rotated, with no shared secrets.
The free Mastering SPIFFE & SPIRE course goes 13 modules deep on this topic. This module gives you the architectural picture; that course gives you the deployment.
Authorization with OPA
Authentication answers “who is calling?”; authorization answers “is this caller allowed to do this?”. OPA (Open Policy Agent) is the CNCF-graduated policy engine that lets you express authz rules as code (in the Rego language), evaluate them at admission time (Kubernetes admission webhook, Kyverno) or at request time (Envoy ext_authz, application middleware).
Sample Rego rule: “a workload from spiffe://example.com/ns/billing/sa/charger may call POST /charges if its tenant_id matches the charge's tenant_id”. The rule lives in version control, runs in CI, ships independently of application code.
API Security
For external API security — user authentication, token formats, OAuth, JWT — the patterns are different. Module 9 of the Cloud Native Security Engineering course covers these. The API Attack & Defense Simulator is the hands-on exercise.
For service-to-service inside your infrastructure: SPIFFE workload identity + mTLS + OPA authz is the production architecture. For human-to-API: OAuth + JWT + scope-based policy. The two patterns coexist; do not blur them.
Federation Across Trust Domains
Multi-cluster and multi-cloud distributed systems need workloads in one cluster to authenticate workloads in another. SPIFFE federation is the mechanism: each trust domain (cluster) exposes its trust bundle via a bundle endpoint; federated peers fetch and trust each other's bundles. SVIDs issued in one cluster are verifiable in another.
This is how you build cross-cluster service-to-service security without VPNs, shared secrets, or per-cluster identity sprawl. The Zero Trust Network Builder simulator walks through SPIFFE federation scenarios in production form.
Operational Practice
- Issue SVIDs valid for 1 hour or less; rotate automatically; never let credentials accumulate validity beyond what an attacker could exploit.
- Authorization decisions log every allow/deny with the principal's SPIFFE ID; this is your audit trail.
- Default-deny at the policy layer; explicit allow rules for known patterns; everything else rejected.
- Treat the workload identity provider (SPIRE) as a tier-0 dependency; HA cluster, backups, tested restoration.
mTLS Handshake Sequence
Self-Check Quiz
- You issue SVIDs valid for 24 hours. The security team objects. Why? (Answer: the longer the validity, the larger the blast radius if a credential leaks. Industry default for SPIFFE SVIDs is 1 hour with 30-min rotation. Short-lived = self-healing.)
- Your OPA policy denies a request. The application returns 500. What is wrong? (Answer: should return 403. 500 is “something broke”; 403 is “policy denied”. The distinction matters for triage.)
- How do you authorise “only orders-service can call payments-service” in OPA? (Answer:
input.peer.spiffe_id == "spiffe://example.com/ns/orders/sa/orders-svc"— or use a path-prefix match for groups of allowed callers.) - SPIFFE federation between two clusters fails 24 hours after rotation. What happened? (Answer: stale trust bundle. The federation peer needs to refresh from the bundle endpoint regularly. Static bundle copies always fail this way.)
- Your service mesh (Istio) provides automatic mTLS. Do you still need SPIFFE? (Answer: Istio uses SPIFFE-style identity internally; explicit SPIFFE/SPIRE is needed for non-mesh workloads, federation across clusters, or richer authz.)
For implementation depth, take the free Mastering SPIFFE & SPIRE course. Reference the glossary on key primitives: SPIFFE, SPIRE, SVID, mTLS, workload identity, Zero Trust, OPA, and service mesh. The SPIFFE/SPIRE cheatsheet, OPA / Rego cheatsheet, and API Security cheatsheet are the operational quick references. Practice with the Zero Trust Network Builder.
Real-World Use Cases
- Bloomberg, Pinterest, Anthem, and Yahoo all run SPIRE in production for service identity at scale.
- Netflix uses an internal SPIFFE-style identity system across thousands of services.
- Most service meshes (Istio, Linkerd) implement SPIFFE-style identity internally even when not labelled as such.
- Open Policy Agent powers Kubernetes admission control for thousands of organisations via Kyverno or Gatekeeper.
Production Notes
- Issue SVIDs valid for 1 hour or less; rotate automatically. Long-lived credentials are accumulated risk.
- Default-deny at the policy layer; explicit allow rules; everything else rejected.
- Treat SPIRE Server as tier-0: HA, KMS-backed encryption at rest, tested restoration runbook.
- Log every authz decision with the principal's SPIFFE ID. That log is your audit trail.
Common Mistakes
- Long-lived (24h+) certificates as a “safety margin”. The opposite is true — longer = larger blast radius if leaked.
- OPA policies returning HTTP 500 on deny instead of 403. Triage gets confused; production stays on fire.
- Substring matching on SPIFFE IDs (<code>strings.Contains(id, "orders")</code>) instead of structured comparison. Trivial to bypass.
- Static trust-bundle copies for federation. Become stale at the next CA rotation.
Security Risks to Watch
- Long-lived shared secrets are accumulating risk. Every leak compounds.
- OPA policies are code — they need code review, CI, version control. Untested Rego is worse than no policy.
- SPIFFE federation across mutually-untrusted clusters requires careful trust-bundle handling. Static copies leak credentials slowly.
- Workload identity provider becomes the most-attacked component. Treat its operational hardening like the database tier.
Design Tradeoffs
Service-mesh-managed mTLS (Istio, Linkerd)
Pros
- Zero application changes
- Automatic rotation
- Policy via mesh CRDs
Cons
- Sidecar latency
- Mesh operational complexity
SPIFFE/SPIRE direct integration
Pros
- Works for non-mesh workloads
- Cross-cluster federation
- Richer authz options
Cons
- Application code changes
- Operate SPIRE
Long-lived secrets + manual rotation
Pros
- No new infra
Cons
- Accumulating risk
- Manual rotation always lags
- Wide blast radius on leak
Production Alternatives
- SPIFFE / SPIRE (vendor-neutral): CNCF spec + reference implementation; works for non-mesh workloads, federations, K8s-or-not.
- Istio mesh-managed identity: SPIFFE-style identity hidden inside Istio; simpler if you already run Istio.
- Linkerd identity: Built-in mTLS using Linkerd's identity service; simplest mesh option.
- AWS IAM Roles Anywhere / GCP Workload Identity Federation: Cloud-native identity for workloads outside Kubernetes; less portable.
- Vault PKI engine: HashiCorp Vault as a CA for short-lived certs; works without SPIFFE conventions.
Think Like an Engineer
- Treat workload identity as your most-attacked component. Operate it with the rigor you give the database tier.
- For every service-to-service call ask: who is calling, with what identity, against what policy, and where is the audit log?
- Authorization rules in version-controlled code (Rego) beats authorization rules in service code; CI catches regressions.
Production Story
A platform team rolled out service-to-service mTLS using a corporate CA, certificates valid for 1 year, mounted as Kubernetes Secrets. A leaked etcd backup six months later contained every cert + private key. Rotation across 200 services took 3 weeks of coordinated change windows. The team migrated to SPIFFE/SPIRE with 1-hour SVIDs; the next leak (a compromised CI runner) had a 1-hour exposure window instead of months.
Career Relevance
Workload identity and Zero Trust are the most leveraged security skills in cloud-native engineering. Engineers fluent in SPIFFE/SPIRE, mTLS, and OPA get pulled into platform-engineering and security-engineering roles. Companies hiring senior platform engineers test these specifically.
Key Terms
- Zero Trust
- Security model that drops the assumption of a trusted internal network; verifies every request.
- mTLS
- Mutual TLS; both client and server authenticate via certificates.
- SPIFFE
- CNCF spec defining a universal workload identity format (SPIFFE ID + SVID).
- SPIRE
- CNCF reference implementation of SPIFFE; issues SVIDs after attesting workloads.
- OPA
- Open Policy Agent; CNCF policy engine for declarative authorization in Rego.
Hands-On Labs
-
Lab 8.1 — mTLS Between Two Services with SPIFFE
Deploy two services on Kubernetes; bootstrap mTLS using SPIRE-issued SVIDs.
120 minutes - Intermediate
- Install SPIRE on kind cluster
- Register workloads with SPIRE selectors
- Implement mTLS server using go-spiffe
- Verify peer identity on every connection
-
Lab 8.2 — OPA Authorization at Envoy
Add OPA ext_authz to Envoy; enforce SPIFFE-ID-based access policy.
90 minutes - Advanced
- Deploy Envoy + OPA sidecar pattern
- Write Rego policy: only orders-api can call payments-api
- Send authorized and unauthorized calls; verify deny path
-
Lab 8.3 — SPIFFE Federation Across Two Clusters
Stand up two kind clusters; federate trust; have a workload in cluster A authenticate to a workload in cluster B.
120 minutes - Advanced
- Stand up two kind clusters
- Install SPIRE in each with distinct trust domains
- Configure bundle endpoint exchange
- Cross-cluster mTLS verified by SPIFFE ID
Key Takeaways
- Zero Trust is an architectural principle: never trust caller location, always verify identity
- mTLS gives bidirectional cryptographic identity; SPIFFE/SPIRE makes it scalable
- Workload identity replaces shared secrets and long-lived credentials
- OPA / Rego puts authorization policy into version control and CI
- Federation extends Zero Trust across clusters and clouds without identity sprawl