Module 9 of 13

Advanced SPIRE Architectures

Production-grade deployments: HA, federation, and multi-cluster

3.5 hours2 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Design high-availability SPIRE deployments
Configure nested SPIRE for hierarchical trust
Implement SPIFFE federation across trust domains
Plan multi-cluster and multi-cloud architectures

Single-server SPIRE works for development. Production requires high availability, multi-cluster trust, and sometimes hierarchical deployments. This module covers the advanced architectures that large organizations deploy.

High Availability SPIRE

A single SPIRE Server is a single point of failure. HA SPIRE uses multiple server replicas with a shared database (PostgreSQL or MySQL), a load balancer in front of the server API, and leader election for CA operations.

# HA SPIRE Server configuration:
server {
    bind_address = "0.0.0.0"
    bind_port = "8081"
    trust_domain = "example.org"

    # Shared datastore (not SQLite!)
    data_dir = "/run/spire/data"
}

plugins {
    DataStore "sql" {
        plugin_data {
            database_type = "postgres"
            connection_string = "host=db.internal dbname=spire sslmode=verify-full"
        }
    }
}

Nested SPIRE

Nested SPIRE creates a hierarchy where a child SPIRE Server gets its CA certificate from a parent SPIRE Server. This is useful for multi-team deployments where each team manages their own SPIRE Server, organizations with multiple environments (dev/staging/prod), and compliance requirements that mandate separate CA hierarchies.

SPIFFE Federation

Federation allows workloads in different trust domains to verify each other’s identities. Each SPIRE Server shares its trust bundle with the other, enabling cross-domain mTLS.

# Server A configuration (federates with cluster-b):
server {
    trust_domain = "cluster-a.company.org"
    federation {
        bundle_endpoint {
            address = "0.0.0.0"
            port = 8443
        }
    }
}

# Register the federated trust domain:
spire-server bundle set -id spiffe://cluster-b.company.org \
  -path /path/to/cluster-b-bundle.pem

Multi-Cloud Architectures

SPIRE works across AWS, GCP, Azure, and on-premise because identity is based on attestation, not cloud-specific constructs. Each environment has its own attestation plugins but all participate in the same trust domain (or federate across domains).

Migration Strategy: Adopting SPIFFE Incrementally

Most companies cannot switch to SPIFFE overnight. The proven migration path:

Phase 1 - Deploy SPIRE alongside existing identity: Run SPIRE in parallel without changing any service. Just get SVIDs flowing.
Phase 2 - Enable mTLS on one critical path: Pick one service-to-service connection (e.g., API → database proxy). Add SPIRE-based mTLS. Keep the old auth as fallback.
Phase 3 - Expand incrementally: Service by service, switch from shared secrets to SVID-based authentication. Each switch is independent and reversible.
Phase 4 - Remove legacy auth: Once all services use SVIDs, remove the old shared secrets, API keys, and static certificates.
Phase 5 - Add authorization: Deploy OPA policies for fine-grained access control on top of the identity layer.

Key principle: coexistence, not replacement. SPIRE can run alongside existing PKI, Vault, and service mesh CAs during migration. You do not need to rip and replace.

Incident Thinking: What Happens If...

SPIRE Server fails? Agents cache SVIDs locally. Existing workloads continue with cached certificates until TTL expires. New workloads cannot get SVIDs until the server recovers. This is why HA is critical.
Datastore becomes unavailable? Server cannot create or modify registration entries but continues serving cached entries. Recovery requires datastore restoration.
Trust bundle expires? All SVID verification fails across the trust domain. This is a catastrophic event - monitor CA TTL and rotate well before expiry.
Federation breaks? Cross-cluster communication fails but intra-cluster communication continues. Each trust domain is independent.
Compromised agent issues rogue SVIDs? The agent can only issue SVIDs for registered workloads on its node. Blast radius is limited to that node. Revoke the agent’s attestation to stop it.

Real world

Where this shows up

Multi-cluster Kubernetes with unified trust
Multi-cloud deployments (AWS + GCP) sharing workload identity
Organizational mergers where separate trust domains need to federate
Disaster recovery where a standby SPIRE Server takes over automatically

Common mistakes

What usually breaks

Running SPIRE Server with SQLite in production (no HA support)
Not planning federation before deploying to multiple clusters
Using different trust domains for dev/staging/prod when they need to communicate
Not testing failover before you need it in an actual outage

Security risks

Threats to watch

Federation trusts everything in the remote trust domain - scope bundles carefully
Compromised SPIRE Server in an HA setup can issue rogue SVIDs until detected
Shared PostgreSQL datastore between replicas is a single point of compromise

Think like an engineer

Questions to answer before shipping

Should federation be centralized (hub-and-spoke) or decentralized (mesh)?
When should you use nested SPIRE vs federated SPIRE?
How do you handle trust domain migration without downtime?
What is the blast radius if one SPIRE Server is compromised?

Labs

Hands-on labs

Deploying SPIRE in HA Mode

Deploy a 3-replica SPIRE Server with PostgreSQL.

Deploy PostgreSQL for SPIRE datastore
Deploy 3 SPIRE Server replicas
Verify leader election and failover
Simulate a server failure and observe recovery

View lab on GitHub

Configuring SPIFFE Federation

Federate two SPIRE deployments for cross-cluster trust.

Deploy two separate SPIRE instances (two Kind clusters)
Exchange trust bundles between the instances
Register federated workload entries
Verify cross-cluster mTLS communication

View lab on GitHub

Recap

Key takeaways

Production SPIRE requires HA with shared database (PostgreSQL/MySQL)
Nested SPIRE enables hierarchical trust for multi-team organizations
Federation allows cross-trust-domain authentication via bundle exchange
Multi-cloud works because attestation is plugin-based, not cloud-specific
Plan trust domain boundaries early - they are hard to change later

Related resources