Deploying SPIRE is step one. Keeping it healthy in production is the ongoing challenge. This module covers the operational practices that keep SPIRE running reliably: monitoring, alerting, debugging, and maintenance.
Monitoring SPIRE
SPIRE exposes Prometheus metrics for both the Server and Agent. Key metrics to monitor:
# Server metrics:
spire_server_ca_manager_x509_ca_rotate_total # CA rotation count
spire_server_registration_api_entry_count # Active registrations
spire_server_node_attestation_duration_seconds # Attestation latency
# Agent metrics:
spire_agent_svid_rotations_total # SVID rotation count
spire_agent_workload_api_connections # Active workload connections
spire_agent_attestation_errors_total # Failed attestations
# Alert on:
# - attestation_errors increasing (broken workload config)
# - svid_rotations_total stalling (rotation failure)
# - node count decreasing (agents disconnecting)
Troubleshooting Common Issues
Node Attestation Fails
# Symptom: Agent cannot connect to Server
# Check: Agent logs for attestation error details
# Common causes:
# - Expired join token (regenerate with spire-server token generate)
# - Network connectivity to SPIRE Server port 8081
# - Clock skew between agent node and server (SVIDs are time-sensitive)
Workload Gets No SVID
# Symptom: Application cannot fetch identity from Workload API
# Debug steps:
# 1. Check registration entries match the workload selectors
spire-server entry show -selector k8s:ns:my-namespace
# 2. Check agent can see the workload
spire-agent api fetch x509 -socketPath /run/spire/sockets/agent.sock
# 3. Check the workload API socket is mounted correctly
ls -la /run/spire/sockets/agent.sock
Certificate Rotation Strategies
SVID TTL defaults to 1 hour. Shorter TTLs are more secure but increase load on the SPIRE Server. Production recommendations: 1 hour for standard workloads, 15 minutes for high-security workloads, and ensure your monitoring alerts if rotation has not occurred within 80% of TTL.
Upgrade Strategies
SPIRE supports rolling upgrades. Upgrade Agents first (they are stateless), then upgrade the Server. Always test upgrades in a non-production environment first. Keep the datastore backed up before Server upgrades.