Module 10 of 13

Day Two Operations and Observability

Monitor, troubleshoot, and maintain SPIRE in production

3 hours2 labsFree

Start here

Learning objectives

  • Monitor SPIRE with Prometheus metrics
  • Debug common attestation and rotation failures
  • Plan certificate rotation and upgrade strategies
  • Implement operational runbooks for SPIRE
SPIRE OBSERVABILITY STACKSPIRE Server + AgentPrometheusGrafanaAlertManagerLoki (Logs)Key Metrics: svid_rotation_count | attestation_errors | bundle_update_latency | active_agents

Deploying SPIRE is step one. Keeping it healthy in production is the ongoing challenge. This module covers the operational practices that keep SPIRE running reliably: monitoring, alerting, debugging, and maintenance.

Monitoring SPIRE

SPIRE exposes Prometheus metrics for both the Server and Agent. Key metrics to monitor:

# Server metrics:
spire_server_ca_manager_x509_ca_rotate_total    # CA rotation count
spire_server_registration_api_entry_count       # Active registrations
spire_server_node_attestation_duration_seconds  # Attestation latency

# Agent metrics:
spire_agent_svid_rotations_total              # SVID rotation count
spire_agent_workload_api_connections          # Active workload connections
spire_agent_attestation_errors_total          # Failed attestations

# Alert on:
# - attestation_errors increasing (broken workload config)
# - svid_rotations_total stalling (rotation failure)
# - node count decreasing (agents disconnecting)

Troubleshooting Common Issues

Node Attestation Fails

# Symptom: Agent cannot connect to Server
# Check: Agent logs for attestation error details
# Common causes:
# - Expired join token (regenerate with spire-server token generate)
# - Network connectivity to SPIRE Server port 8081
# - Clock skew between agent node and server (SVIDs are time-sensitive)

Workload Gets No SVID

# Symptom: Application cannot fetch identity from Workload API
# Debug steps:
# 1. Check registration entries match the workload selectors
spire-server entry show -selector k8s:ns:my-namespace

# 2. Check agent can see the workload
spire-agent api fetch x509 -socketPath /run/spire/sockets/agent.sock

# 3. Check the workload API socket is mounted correctly
ls -la /run/spire/sockets/agent.sock

Certificate Rotation Strategies

SVID TTL defaults to 1 hour. Shorter TTLs are more secure but increase load on the SPIRE Server. Production recommendations: 1 hour for standard workloads, 15 minutes for high-security workloads, and ensure your monitoring alerts if rotation has not occurred within 80% of TTL.

Upgrade Strategies

SPIRE supports rolling upgrades. Upgrade Agents first (they are stateless), then upgrade the Server. Always test upgrades in a non-production environment first. Keep the datastore backed up before Server upgrades.

Real world

Where this shows up

  • 24/7 monitoring of SVID rotation across 100+ services
  • Incident response for certificate expiry events
  • Capacity planning for SPIRE Server based on registration entry growth
  • Automated alerting for attestation failures indicating configuration drift

Production notes

Keep these close

  • Set alerts for: svid_rotations_total stalling, attestation_errors increasing, active_agent_count decreasing.
  • Always backup the SPIRE datastore before upgrades. A corrupted migration means regenerating all registrations.
  • SPIRE supports rolling upgrades: upgrade Agents first (they are stateless), then the Server.

Common mistakes

What usually breaks

  • Not monitoring SVID rotation — stalled rotation means imminent certificate expiry
  • Running upgrades without backup — corrupted datastore means losing all registrations
  • Ignoring clock skew between nodes — SVIDs have time-based validity
  • No runbooks — team scrambles during incidents instead of following documented procedures

Labs

Hands-on labs

Monitoring SPIRE Metrics

Set up Prometheus and Grafana dashboards for SPIRE.

  1. Configure SPIRE to expose Prometheus metrics
  2. Deploy Prometheus with SPIRE scrape targets
  3. Import the SPIRE Grafana dashboard
  4. Create alerts for attestation failures and rotation stalls
View lab on GitHub

Debugging Registration Failures

Troubleshoot common SPIRE issues using production debugging techniques.

  1. Introduce intentional misconfigurations
  2. Use spire-server and spire-agent CLI for debugging
  3. Read agent and server logs to identify root causes
  4. Fix the issues and verify recovery
View lab on GitHub

Recap

Key takeaways

  • Monitor svid_rotations, attestation_errors, and active_agents continuously
  • Alert on rotation stalls — a stopped rotation means expired certificates soon
  • Debug workload identity issues by checking: registration entry, agent logs, socket mount
  • Upgrade agents first (stateless), then server — always backup the datastore
  • Document operational runbooks for common failure scenarios

Related resources

Keep learning across CodersSecret