Module 9: Observability & Debugging Slides
Slide walkthrough for Module 9 of Distributed Systems Engineering: Building Scalable, Reliable & Secure Systems: Distributed tracing, metrics, structured...
This slide page is the visual review companion for the full course module. Use it to recap the architecture, examples, exercises, production warnings, and takeaways after reading the lesson.
Slide Outline
- Observability & Debugging - Distributed tracing, metrics, structured logging, correlation IDs, and the OpenTelemetry / Prometheus / Grafana / Jaeger stack that lets you debug systems you cannot SSH into.
- Learning Objectives - 5 outcomes for this module
- Why This Module Matters - Observability is the difference between debuggable and undebuggable systems. A distributed system you cannot trace is a
- Before vs After - The operational shift this module teaches
- The Three Pillars (and the Fourth) - Lesson section from the full module
- OpenTelemetry - The Standard - Lesson section from the full module
- Distributed Tracing - Lesson section from the full module
- Sampling - Lesson section from the full module
- The Four Golden Signals - Lesson section from the full module
- Structured Logging and Correlation IDs - Lesson section from the full module
- SLO and Error Budget - Lesson section from the full module
- Debugging Distributed Systems - Lesson section from the full module
- Real-World Use Cases - Netflix runs distributed tracing across thousands of services with sampled tail-based collection., Google's Dapper paper (2010) is the foundation of modern distributed tracing.
- Common Mistakes to Avoid - 4 mistakes covered
- Production Notes - 4 practical notes
- Security Risks to Watch - 3 risks covered
- Hands-On Labs - 3 hands-on labs
- Key Takeaways - 5 points to remember
Learning Objectives
- Instrument a service with OpenTelemetry traces, metrics, and logs
- Correlate a single request across many services via trace IDs
- Build the four golden signals (latency, traffic, errors, saturation) in Prometheus
- Read a distributed trace and identify where latency accrues
- Build the runbook a 3am on-call engineer actually uses
Why This Module Matters
Observability is the difference between debuggable and undebuggable systems. A distributed system you cannot trace is a distributed system you cannot operate at scale. Engineers who build observability in from the start have meaningfully shorter MTTR; engineers who bolt it on after the first incident spend years catching up. SLOs and error budgets convert observability into engineering discipline that aligns product velocity with reliability.
Production Notes
- Use OpenTelemetry; one SDK, many backends. Vendor-specific SDKs are a future migration cost.
- Sample tail-based, not head-based, for error visibility. Or use head-based at modest rate plus force-sample on errors.
- Define SLOs that map to user experience, not system health. p99 latency on the checkout flow matters; p99 on the health-check endpoint does not.
- Tag every log with trace_id and user_id. Without correlation, logs at scale are unsearchable.
Common Mistakes
- Tracing the easy services first. The services without tracing become invisible — usually the legacy ones causing the incidents.
- Alert on anything that wiggles. Alert fatigue is a category of incident on its own.
- Track averages instead of percentiles. Means hide tail behaviour; p99 is the truth.
- Burn through error budget without slowing down. The whole point is to slow shipping when the budget is exhausted.
Key Takeaways
- Observability has three pillars (metrics, logs, traces) plus a fourth (profiles); use them together
- OpenTelemetry is the standard; one SDK, many backends
- Distributed tracing answers latency questions metrics cannot — instrument every service
- The four golden signals (latency, traffic, errors, saturation) are the minimum metric set per service
- SLOs and error budgets convert observability into engineering discipline
Hands-On Labs
-
Lab 9.1 — Trace a Request End-to-End
Instrument a 3-service chain with OTel; trace a request across all three; visualise in Jaeger.
60 minutes - Intermediate
- Add OTel SDK to each service
- Propagate W3C Trace Context across HTTP/gRPC calls
- Send a request; find the trace in Jaeger
- Identify the slowest span
-
Lab 9.2 — Build the Four Golden Signals
Add Prometheus instrumentation for latency, traffic, errors, saturation; build a Grafana dashboard.
90 minutes - Intermediate
- Instrument requests with histogram + counter
- Track connection pool saturation
- Build a Grafana dashboard with all four signals
- Define an SLO and visualise the burn rate
-
Lab 9.3 — Incident Triage from Telemetry
Inject a partial failure; use only the dashboards and traces to identify root cause.
60 minutes - Intermediate
- Inject a 30% error rate at one downstream service
- Use only Grafana + Jaeger to identify which service
- Identify which user-facing endpoint is most affected
- Document the runbook