Module 10: Day Two Operations and Observability
Monitor, troubleshoot, and maintain SPIRE in production
3 hours. 2 hands-on labs. Free course module.
Learning Objectives
- Monitor SPIRE with Prometheus metrics
- Debug common attestation and rotation failures
- Plan certificate rotation and upgrade strategies
- Implement operational runbooks for SPIRE
Why This Matters
Deploying SPIRE is day one. Keeping it running reliably is every day after. Production incidents from certificate expiry, attestation failures, and datastore issues are inevitable. This module gives you the monitoring, debugging, and operational playbooks to handle them confidently.
Lesson Content
Deploying SPIRE is step one. Keeping it healthy in production is the ongoing challenge. This module covers the operational practices that keep SPIRE running reliably: monitoring, alerting, debugging, and maintenance.
Monitoring SPIRE
SPIRE exposes Prometheus metrics for both the Server and Agent. Key metrics to monitor:
# Server metrics:
spire_server_ca_manager_x509_ca_rotate_total # CA rotation count
spire_server_registration_api_entry_count # Active registrations
spire_server_node_attestation_duration_seconds # Attestation latency
# Agent metrics:
spire_agent_svid_rotations_total # SVID rotation count
spire_agent_workload_api_connections # Active workload connections
spire_agent_attestation_errors_total # Failed attestations
# Alert on:
# - attestation_errors increasing (broken workload config)
# - svid_rotations_total stalling (rotation failure)
# - node count decreasing (agents disconnecting)
Troubleshooting Common Issues
Node Attestation Fails
# Symptom: Agent cannot connect to Server
# Check: Agent logs for attestation error details
# Common causes:
# - Expired join token (regenerate with spire-server token generate)
# - Network connectivity to SPIRE Server port 8081
# - Clock skew between agent node and server (SVIDs are time-sensitive)
Workload Gets No SVID
# Symptom: Application cannot fetch identity from Workload API
# Debug steps:
# 1. Check registration entries match the workload selectors
spire-server entry show -selector k8s:ns:my-namespace
# 2. Check agent can see the workload
spire-agent api fetch x509 -socketPath /run/spire/sockets/agent.sock
# 3. Check the workload API socket is mounted correctly
ls -la /run/spire/sockets/agent.sock
Certificate Rotation Strategies
SVID TTL defaults to 1 hour. Shorter TTLs are more secure but increase load on the SPIRE Server. Production recommendations: 1 hour for standard workloads, 15 minutes for high-security workloads, and ensure your monitoring alerts if rotation has not occurred within 80% of TTL.
Upgrade Strategies
SPIRE supports rolling upgrades. Upgrade Agents first (they are stateless), then upgrade the Server. Always test upgrades in a non-production environment first. Keep the datastore backed up before Server upgrades.
Real-World Use Cases
- 24/7 monitoring of SVID rotation across 100+ services
- Incident response for certificate expiry events
- Capacity planning for SPIRE Server based on registration entry growth
- Automated alerting for attestation failures indicating configuration drift
Production Notes
- Set alerts for: svid_rotations_total stalling, attestation_errors increasing, active_agent_count decreasing.
- Always backup the SPIRE datastore before upgrades. A corrupted migration means regenerating all registrations.
- SPIRE supports rolling upgrades: upgrade Agents first (they are stateless), then the Server.
Common Mistakes
- Not monitoring SVID rotation — stalled rotation means imminent certificate expiry
- Running upgrades without backup — corrupted datastore means losing all registrations
- Ignoring clock skew between nodes — SVIDs have time-based validity
- No runbooks — team scrambles during incidents instead of following documented procedures
Production Story
A production SPIRE deployment went 3 months without issues. Then one Monday morning, services started failing with TLS handshake errors. Investigation revealed that SVID rotation had silently stalled on 12 nodes after a kernel update changed the clock synchronization. The team had no alerting for rotation metrics. After adding Prometheus alerts for stalled rotations, they caught the next occurrence in 5 minutes instead of 3 hours.
Career Relevance
SRE and platform engineering roles increasingly require operating identity infrastructure. Engineers who can monitor, troubleshoot, and maintain SPIRE in production are scarce and highly valued.
Hands-On Labs
-
Monitoring SPIRE Metrics
Set up Prometheus and Grafana dashboards for SPIRE.
- Configure SPIRE to expose Prometheus metrics
- Deploy Prometheus with SPIRE scrape targets
- Import the SPIRE Grafana dashboard
- Create alerts for attestation failures and rotation stalls
-
Debugging Registration Failures
Troubleshoot common SPIRE issues using production debugging techniques.
- Introduce intentional misconfigurations
- Use spire-server and spire-agent CLI for debugging
- Read agent and server logs to identify root causes
- Fix the issues and verify recovery
Key Takeaways
- Monitor svid_rotations, attestation_errors, and active_agents continuously
- Alert on rotation stalls — a stopped rotation means expired certificates soon
- Debug workload identity issues by checking: registration entry, agent logs, socket mount
- Upgrade agents first (stateless), then server — always backup the datastore
- Document operational runbooks for common failure scenarios