Module 10 of 13

Day Two Operations and Observability

Monitor, troubleshoot, and maintain SPIRE in production

3 hours2 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Monitor SPIRE with Prometheus metrics
Debug common attestation and rotation failures
Plan certificate rotation and upgrade strategies
Implement operational runbooks for SPIRE

Deploying SPIRE is step one. Keeping it healthy in production is the ongoing challenge. This module covers the operational practices that keep SPIRE running reliably: monitoring, alerting, debugging, and maintenance.

Monitoring SPIRE

SPIRE exposes Prometheus metrics for both the Server and Agent. Key metrics to monitor:

# Server metrics:
spire_server_ca_manager_x509_ca_rotate_total    # CA rotation count
spire_server_registration_api_entry_count       # Active registrations
spire_server_node_attestation_duration_seconds  # Attestation latency

# Agent metrics:
spire_agent_svid_rotations_total              # SVID rotation count
spire_agent_workload_api_connections          # Active workload connections
spire_agent_attestation_errors_total          # Failed attestations

# Alert on:
# - attestation_errors increasing (broken workload config)
# - svid_rotations_total stalling (rotation failure)
# - node count decreasing (agents disconnecting)

Troubleshooting Common Issues

Node Attestation Fails

# Symptom: Agent cannot connect to Server
# Check: Agent logs for attestation error details
# Common causes:
# - Expired join token (regenerate with spire-server token generate)
# - Network connectivity to SPIRE Server port 8081
# - Clock skew between agent node and server (SVIDs are time-sensitive)

Workload Gets No SVID

# Symptom: Application cannot fetch identity from Workload API
# Debug steps:
# 1. Check registration entries match the workload selectors
spire-server entry show -selector k8s:ns:my-namespace

# 2. Check agent can see the workload
spire-agent api fetch x509 -socketPath /run/spire/sockets/agent.sock

# 3. Check the workload API socket is mounted correctly
ls -la /run/spire/sockets/agent.sock

Certificate Rotation Strategies

SVID TTL defaults to 1 hour. Shorter TTLs are more secure but increase load on the SPIRE Server. Production recommendations: 1 hour for standard workloads, 15 minutes for high-security workloads, and ensure your monitoring alerts if rotation has not occurred within 80% of TTL.

Upgrade Strategies

SPIRE supports rolling upgrades. Upgrade Agents first (they are stateless), then upgrade the Server. Always test upgrades in a non-production environment first. Keep the datastore backed up before Server upgrades.

Real world

Where this shows up

24/7 monitoring of SVID rotation across 100+ services
Incident response for certificate expiry events
Capacity planning for SPIRE Server based on registration entry growth
Automated alerting for attestation failures indicating configuration drift

Production notes

Keep these close

Set alerts for: svid_rotations_total stalling, attestation_errors increasing, active_agent_count decreasing.
Always backup the SPIRE datastore before upgrades. A corrupted migration means regenerating all registrations.
SPIRE supports rolling upgrades: upgrade Agents first (they are stateless), then the Server.

Common mistakes

What usually breaks

Not monitoring SVID rotation - stalled rotation means imminent certificate expiry
Running upgrades without backup - corrupted datastore means losing all registrations
Ignoring clock skew between nodes - SVIDs have time-based validity
No runbooks - team scrambles during incidents instead of following documented procedures

Labs

Hands-on labs

Monitoring SPIRE Metrics

Set up Prometheus and Grafana dashboards for SPIRE.

Configure SPIRE to expose Prometheus metrics
Deploy Prometheus with SPIRE scrape targets
Import the SPIRE Grafana dashboard
Create alerts for attestation failures and rotation stalls

View lab on GitHub

Debugging Registration Failures

Troubleshoot common SPIRE issues using production debugging techniques.

Introduce intentional misconfigurations
Use spire-server and spire-agent CLI for debugging
Read agent and server logs to identify root causes
Fix the issues and verify recovery

View lab on GitHub

Recap

Key takeaways

Monitor svid_rotations, attestation_errors, and active_agents continuously
Alert on rotation stalls - a stopped rotation means expired certificates soon
Debug workload identity issues by checking: registration entry, agent logs, socket mount
Upgrade agents first (stateless), then server - always backup the datastore
Document operational runbooks for common failure scenarios

Related resources