Module 9: Advanced SPIRE Architectures
Production-grade deployments: HA, federation, and multi-cluster
3.5 hours. 2 hands-on labs. Free course module.
Learning Objectives
- Design high-availability SPIRE deployments
- Configure nested SPIRE for hierarchical trust
- Implement SPIFFE federation across trust domains
- Plan multi-cluster and multi-cloud architectures
Why This Matters
Single-server SPIRE works for demos. Production requires high availability, multi-cluster federation, and disaster recovery. This module teaches you the architecture patterns that organizations with thousands of services deploy.
Lesson Content
Single-server SPIRE works for development. Production requires high availability, multi-cluster trust, and sometimes hierarchical deployments. This module covers the advanced architectures that large organizations deploy.
High Availability SPIRE
A single SPIRE Server is a single point of failure. HA SPIRE uses multiple server replicas with a shared database (PostgreSQL or MySQL), a load balancer in front of the server API, and leader election for CA operations.
# HA SPIRE Server configuration:
server {
bind_address = "0.0.0.0"
bind_port = "8081"
trust_domain = "example.org"
# Shared datastore (not SQLite!)
data_dir = "/run/spire/data"
}
plugins {
DataStore "sql" {
plugin_data {
database_type = "postgres"
connection_string = "host=db.internal dbname=spire sslmode=verify-full"
}
}
}
Nested SPIRE
Nested SPIRE creates a hierarchy where a child SPIRE Server gets its CA certificate from a parent SPIRE Server. This is useful for multi-team deployments where each team manages their own SPIRE Server, organizations with multiple environments (dev/staging/prod), and compliance requirements that mandate separate CA hierarchies.
SPIFFE Federation
Federation allows workloads in different trust domains to verify each other’s identities. Each SPIRE Server shares its trust bundle with the other, enabling cross-domain mTLS.
# Server A configuration (federates with cluster-b):
server {
trust_domain = "cluster-a.company.org"
federation {
bundle_endpoint {
address = "0.0.0.0"
port = 8443
}
}
}
# Register the federated trust domain:
spire-server bundle set -id spiffe://cluster-b.company.org \
-path /path/to/cluster-b-bundle.pem
Multi-Cloud Architectures
SPIRE works across AWS, GCP, Azure, and on-premise because identity is based on attestation, not cloud-specific constructs. Each environment has its own attestation plugins but all participate in the same trust domain (or federate across domains).
Migration Strategy: Adopting SPIFFE Incrementally
Most companies cannot switch to SPIFFE overnight. The proven migration path:
- Phase 1 — Deploy SPIRE alongside existing identity: Run SPIRE in parallel without changing any service. Just get SVIDs flowing.
- Phase 2 — Enable mTLS on one critical path: Pick one service-to-service connection (e.g., API → database proxy). Add SPIRE-based mTLS. Keep the old auth as fallback.
- Phase 3 — Expand incrementally: Service by service, switch from shared secrets to SVID-based authentication. Each switch is independent and reversible.
- Phase 4 — Remove legacy auth: Once all services use SVIDs, remove the old shared secrets, API keys, and static certificates.
- Phase 5 — Add authorization: Deploy OPA policies for fine-grained access control on top of the identity layer.
Key principle: coexistence, not replacement. SPIRE can run alongside existing PKI, Vault, and service mesh CAs during migration. You do not need to rip and replace.
Incident Thinking: What Happens If...
- SPIRE Server fails? Agents cache SVIDs locally. Existing workloads continue with cached certificates until TTL expires. New workloads cannot get SVIDs until the server recovers. This is why HA is critical.
- Datastore becomes unavailable? Server cannot create or modify registration entries but continues serving cached entries. Recovery requires datastore restoration.
- Trust bundle expires? All SVID verification fails across the trust domain. This is a catastrophic event — monitor CA TTL and rotate well before expiry.
- Federation breaks? Cross-cluster communication fails but intra-cluster communication continues. Each trust domain is independent.
- Compromised agent issues rogue SVIDs? The agent can only issue SVIDs for registered workloads on its node. Blast radius is limited to that node. Revoke the agent’s attestation to stop it.
Real-World Use Cases
- Multi-cluster Kubernetes with unified trust
- Multi-cloud deployments (AWS + GCP) sharing workload identity
- Organizational mergers where separate trust domains need to federate
- Disaster recovery where a standby SPIRE Server takes over automatically
Common Mistakes
- Running SPIRE Server with SQLite in production (no HA support)
- Not planning federation before deploying to multiple clusters
- Using different trust domains for dev/staging/prod when they need to communicate
- Not testing failover before you need it in an actual outage
Security Risks to Watch
- Federation trusts everything in the remote trust domain — scope bundles carefully
- Compromised SPIRE Server in an HA setup can issue rogue SVIDs until detected
- Shared PostgreSQL datastore between replicas is a single point of compromise
Think Like an Engineer
- Should federation be centralized (hub-and-spoke) or decentralized (mesh)?
- When should you use nested SPIRE vs federated SPIRE?
- How do you handle trust domain migration without downtime?
- What is the blast radius if one SPIRE Server is compromised?
Production Story
An e-commerce platform expanded from one Kubernetes cluster to three across two cloud providers. Initially, each cluster had its own identity system — separate Vault instances, separate certificates. Cross-cluster communication required manual certificate exchange. After deploying federated SPIRE, services in any cluster could verify identities from any other cluster automatically. The federation bundle exchange took 15 minutes to configure.
Hands-On Labs
-
Deploying SPIRE in HA Mode
Deploy a 3-replica SPIRE Server with PostgreSQL.
- Deploy PostgreSQL for SPIRE datastore
- Deploy 3 SPIRE Server replicas
- Verify leader election and failover
- Simulate a server failure and observe recovery
-
Configuring SPIFFE Federation
Federate two SPIRE deployments for cross-cluster trust.
- Deploy two separate SPIRE instances (two Kind clusters)
- Exchange trust bundles between the instances
- Register federated workload entries
- Verify cross-cluster mTLS communication
Key Takeaways
- Production SPIRE requires HA with shared database (PostgreSQL/MySQL)
- Nested SPIRE enables hierarchical trust for multi-team organizations
- Federation allows cross-trust-domain authentication via bundle exchange
- Multi-cloud works because attestation is plugin-based, not cloud-specific
- Plan trust domain boundaries early — they are hard to change later