Skip to main content

Module 10: Day Two Operations and Observability Slides

Slide walkthrough for Module 10 of Mastering SPIFFE & SPIRE: Zero Trust for Cloud Native Systems: Monitor, troubleshoot, and maintain SPIRE in production...

This slide page is the visual review companion for the full course module. Use it to recap the architecture, examples, exercises, production warnings, and takeaways after reading the lesson.

Slide Outline

  1. Day Two Operations and Observability - Monitor, troubleshoot, and maintain SPIRE in production
  2. Learning Objectives - 4 outcomes for this module
  3. Why This Module Matters - Deploying SPIRE is day one. Keeping it running reliably is every day after. Production incidents from certificate expiry
  4. Monitoring SPIRE - Lesson section from the full module
  5. Troubleshooting Common Issues - Lesson section from the full module
  6. Certificate Rotation Strategies - Lesson section from the full module
  7. Upgrade Strategies - Lesson section from the full module
  8. Real-World Use Cases - 24/7 monitoring of SVID rotation across 100+ services, Incident response for certificate expiry events
  9. Common Mistakes to Avoid - 4 mistakes covered
  10. Production Notes - 3 practical notes
  11. Hands-On Labs - 2 hands-on labs
  12. Key Takeaways - 5 points to remember

Learning Objectives

  • Monitor SPIRE with Prometheus metrics
  • Debug common attestation and rotation failures
  • Plan certificate rotation and upgrade strategies
  • Implement operational runbooks for SPIRE

Why This Module Matters

Deploying SPIRE is day one. Keeping it running reliably is every day after. Production incidents from certificate expiry, attestation failures, and datastore issues are inevitable. This module gives you the monitoring, debugging, and operational playbooks to handle them confidently.

Production Notes

  • Set alerts for: svid_rotations_total stalling, attestation_errors increasing, active_agent_count decreasing.
  • Always backup the SPIRE datastore before upgrades. A corrupted migration means regenerating all registrations.
  • SPIRE supports rolling upgrades: upgrade Agents first (they are stateless), then the Server.

Common Mistakes

  • Not monitoring SVID rotation — stalled rotation means imminent certificate expiry
  • Running upgrades without backup — corrupted datastore means losing all registrations
  • Ignoring clock skew between nodes — SVIDs have time-based validity
  • No runbooks — team scrambles during incidents instead of following documented procedures

Key Takeaways

  • Monitor svid_rotations, attestation_errors, and active_agents continuously
  • Alert on rotation stalls — a stopped rotation means expired certificates soon
  • Debug workload identity issues by checking: registration entry, agent logs, socket mount
  • Upgrade agents first (stateless), then server — always backup the datastore
  • Document operational runbooks for common failure scenarios

Hands-On Labs

  1. Monitoring SPIRE Metrics

    Set up Prometheus and Grafana dashboards for SPIRE.

    • Configure SPIRE to expose Prometheus metrics
    • Deploy Prometheus with SPIRE scrape targets
    • Import the SPIRE Grafana dashboard
    • Create alerts for attestation failures and rotation stalls

    View lab files on GitHub

  2. Debugging Registration Failures

    Troubleshoot common SPIRE issues using production debugging techniques.

    • Introduce intentional misconfigurations
    • Use spire-server and spire-agent CLI for debugging
    • Read agent and server logs to identify root causes
    • Fix the issues and verify recovery

    View lab files on GitHub

Read the full module | Back to course curriculum