Module 10: Day Two Operations and Observability Slides
Slide walkthrough for Module 10 of Mastering SPIFFE & SPIRE: Zero Trust for Cloud Native Systems: Monitor, troubleshoot, and maintain SPIRE in production...
This slide page is the visual review companion for the full course module. Use it to recap the architecture, examples, exercises, production warnings, and takeaways after reading the lesson.
Slide Outline
- Day Two Operations and Observability - Monitor, troubleshoot, and maintain SPIRE in production
- Learning Objectives - 4 outcomes for this module
- Why This Module Matters - Deploying SPIRE is day one. Keeping it running reliably is every day after. Production incidents from certificate expiry
- Monitoring SPIRE - Lesson section from the full module
- Troubleshooting Common Issues - Lesson section from the full module
- Certificate Rotation Strategies - Lesson section from the full module
- Upgrade Strategies - Lesson section from the full module
- Real-World Use Cases - 24/7 monitoring of SVID rotation across 100+ services, Incident response for certificate expiry events
- Common Mistakes to Avoid - 4 mistakes covered
- Production Notes - 3 practical notes
- Hands-On Labs - 2 hands-on labs
- Key Takeaways - 5 points to remember
Learning Objectives
- Monitor SPIRE with Prometheus metrics
- Debug common attestation and rotation failures
- Plan certificate rotation and upgrade strategies
- Implement operational runbooks for SPIRE
Why This Module Matters
Deploying SPIRE is day one. Keeping it running reliably is every day after. Production incidents from certificate expiry, attestation failures, and datastore issues are inevitable. This module gives you the monitoring, debugging, and operational playbooks to handle them confidently.
Production Notes
- Set alerts for: svid_rotations_total stalling, attestation_errors increasing, active_agent_count decreasing.
- Always backup the SPIRE datastore before upgrades. A corrupted migration means regenerating all registrations.
- SPIRE supports rolling upgrades: upgrade Agents first (they are stateless), then the Server.
Common Mistakes
- Not monitoring SVID rotation — stalled rotation means imminent certificate expiry
- Running upgrades without backup — corrupted datastore means losing all registrations
- Ignoring clock skew between nodes — SVIDs have time-based validity
- No runbooks — team scrambles during incidents instead of following documented procedures
Key Takeaways
- Monitor svid_rotations, attestation_errors, and active_agents continuously
- Alert on rotation stalls — a stopped rotation means expired certificates soon
- Debug workload identity issues by checking: registration entry, agent logs, socket mount
- Upgrade agents first (stateless), then server — always backup the datastore
- Document operational runbooks for common failure scenarios
Hands-On Labs
-
Monitoring SPIRE Metrics
Set up Prometheus and Grafana dashboards for SPIRE.
- Configure SPIRE to expose Prometheus metrics
- Deploy Prometheus with SPIRE scrape targets
- Import the SPIRE Grafana dashboard
- Create alerts for attestation failures and rotation stalls
-
Debugging Registration Failures
Troubleshoot common SPIRE issues using production debugging techniques.
- Introduce intentional misconfigurations
- Use spire-server and spire-agent CLI for debugging
- Read agent and server logs to identify root causes
- Fix the issues and verify recovery