Module 9 of 12

Observability & Debugging

Distributed tracing, metrics, structured logging, correlation IDs, and the OpenTelemetry / Prometheus / Grafana / Jaeger stack that lets you debug systems you cannot SSH into.

4 hours3 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Instrument a service with OpenTelemetry traces, metrics, and logs
Correlate a single request across many services via trace IDs
Build the four golden signals (latency, traffic, errors, saturation) in Prometheus
Read a distributed trace and identify where latency accrues
Build the runbook a 3am on-call engineer actually uses

Before

Print-statement debugging across distributed services; root-cause takes hours
Per-service dashboards in different tools; correlation is manual
Alert fatigue from threshold-based alerts that fire on noise
No SLOs; everyone has different definitions of “working”

After

Distributed traces correlate the full request across services
Unified Grafana on top of Prometheus + Tempo + Loki; single pane of glass
SLO + error-budget burn-rate alerts; signal over noise
Documented SLOs aligned to user experience; product decisions tied to error budget

Observability is what separates a system you can debug from one you cannot. In a monolith, debugging means reading a stack trace. In a distributed system, debugging means reconstructing a request across many services from telemetry alone. If your observability is poor, your incidents are unrecoverable. If it is good, the system tells you where it broke.

The Three Pillars (and the Fourth)

Metrics: numeric time-series. Cheap to store, cheap to query, great for alerting and dashboards. Prometheus is the open-source standard.
Logs: discrete events with timestamps and structure. Expensive to store at scale; great for debugging known issues.
Traces: per-request causality across services. Heavy to capture and store; essential for debugging latency.
Profiles (the modern fourth): CPU and allocation samples per service over time (Pyroscope, Parca). Catches what metrics aggregate away.

OpenTelemetry - The Standard

OpenTelemetry (OTel) is the CNCF project that unifies instrumentation. One SDK in your application emits all three signals; an OTel Collector batches, samples, and ships them to whichever backends you choose. The vendor lock-in problem of older instrumentation libraries is solved.

Production pattern: every service uses the OTel SDK; an OTel Collector runs as a DaemonSet on Kubernetes; the Collector forwards traces to Jaeger/Tempo, metrics to Prometheus, logs to Loki or Elasticsearch. Grafana is the unified UI on top.

Distributed Tracing

A trace is a tree of spans, where each span represents a unit of work (an HTTP call, a database query, a message handler). The root span is the user request; child spans are everything it triggered. Spans carry a trace ID (propagated across service boundaries via the W3C Trace Context header) and a span ID (parent reference).

Tracing answers questions metrics cannot: why is p99 of request type X high? A trace shows you exactly which downstream service contributed the latency. Why does this rare error happen? The trace shows the full causal chain. Where is this request actually going? The trace reveals architectural surprises (services calling services you forgot existed).

Sampling

Tracing every request is expensive at scale. Sampling reduces volume:

Head-based sampling: decide at the start of the trace whether to keep it (e.g. 1% of all requests). Simple; misses error traces.
Tail-based sampling: collect everything, decide after the trace completes (e.g. keep all error traces and a sample of success traces). Better visibility; harder to operate.
Adaptive / hybrid: head-sample at modest rate; force-sample known interesting paths (errors, slow requests, specific endpoints).

The Four Golden Signals

From the Google SRE book, the four metrics every service should emit:

Latency: time to serve a request. Track p50, p95, p99; alert on p99.
Traffic: requests per second. Sudden change is a signal even if everything else looks fine.
Errors: failed requests. Express as a rate (errors per second) or a ratio (error rate over total RPS).
Saturation: how full the system is. CPU, memory, queue depth, connection pool utilisation. Saturation precedes other failures.

Every dashboard, every alert, every SLO connects back to these four. If you only emit four metrics per service, emit these.

Structured Logging and Correlation IDs

Logs are useful when they are queryable. That means structured (JSON or key=value) and correlated. Every log line should include the trace ID so you can filter by request and the user/tenant ID so you can debug per-user issues.

The minimum log line for a service: timestamp, level, service, trace_id, span_id, user_id, message, ...fields. Anything less and your logs are unsearchable at scale.

SLO and Error Budget

Service Level Objectives (SLOs) translate the four golden signals into commitments. “p99 latency < 300ms over 30 days” or “99.9% of requests return 2xx/3xx over 30 days”. The error budget is the difference between 100% and the SLO - the amount of failure you have permission to spend on risky changes, deploys, or experiments.

Operating with explicit SLOs and error budgets is the discipline of the modern SRE function. The pattern: when error budget is healthy, you ship features fast; when it is exhausted, you stop shipping and stabilise.

Debugging Distributed Systems

The flow that works in practice:

Alert fires ⇒ identify the affected service from dashboard.
Check the four golden signals for that service.
Pick a representative failed trace; walk it span by span; identify where latency or error appears.
If it is a downstream service, recurse: open that service's dashboard, repeat.
If it is in the service itself, jump to logs filtered by that trace ID.
If logs do not show the cause, jump to profiles.

That's the loop. Every minute saved in this loop is a minute off MTTR.

Distributed Request Trace Timeline

Self-Check Quiz

Why are metrics, logs, and traces complementary rather than redundant? (Answer: metrics tell you something is wrong; traces tell you which service; logs tell you why. Each answers a different question at a different cost.)
You have head-based sampling at 1%. Errors get under-represented. What is the fix? (Answer: tail-based sampling - sample after the trace completes, force-include error traces. Or hybrid head+force-on-error.)
Your dashboard shows error rate at 0.1% - within SLO. Customer support says many users complain. What is happening? (Answer: averages hide tail. Your 0.1% may be concentrated on one tenant or one feature. Slice metrics by user/tenant/feature, not just service-level.)
What are the four golden signals and why does saturation matter? (Answer: latency, traffic, errors, saturation. Saturation precedes the other three failing - the queue fills before latency climbs before errors fire.)

For runtime-detection observability the Runtime Security cheatsheet covers Falco/Tetragon eBPF telemetry alongside the application-layer signals.

Real world

Where this shows up

Netflix runs distributed tracing across thousands of services with sampled tail-based collection.
Google's Dapper paper (2010) is the foundation of modern distributed tracing.
Cloudflare uses Honeycomb (event-driven observability) for high-cardinality investigation.
Uber built Jaeger (now CNCF) to handle their tracing volume; donated it to the community.

Production notes

Keep these close

Use OpenTelemetry; one SDK, many backends. Vendor-specific SDKs are a future migration cost.
Sample tail-based, not head-based, for error visibility. Or use head-based at modest rate plus force-sample on errors.
Define SLOs that map to user experience, not system health. p99 latency on the checkout flow matters; p99 on the health-check endpoint does not.
Tag every log with trace_id and user_id. Without correlation, logs at scale are unsearchable.

Common mistakes

What usually breaks

Tracing the easy services first. The services without tracing become invisible - usually the legacy ones causing the incidents.
Alert on anything that wiggles. Alert fatigue is a category of incident on its own.
Track averages instead of percentiles. Means hide tail behaviour; p99 is the truth.
Burn through error budget without slowing down. The whole point is to slow shipping when the budget is exhausted.

Security risks

Threats to watch

Logs containing tokens, passwords, or PII become a parallel data-exfiltration target. Apply structured-log scrubbing at the collector.
OpenTelemetry collectors with default settings expose internal trace data via debug endpoints. Lock them down.
Trace context (W3C Trace Context header) propagates across trust boundaries; sanitise before forwarding to third parties.

Tradeoffs

Design choices you should be able to defend

OpenTelemetry + vendor backends

Pros

Vendor-neutral
Active CNCF project
Wide language support

Cons

Newer than Jaeger/Zipkin native SDKs
Some maturity gaps

Vendor SDK (Datadog, New Relic)

Pros

Tightest integration with their UI
Quickest to ship

Cons

Lock-in
Per-language coverage varies

Push (agent sends) vs pull (Prometheus scrapes) metrics

Pros

Push handles short-lived workloads
Pull works well for stable services

Cons

Push needs aggregation; pull needs service discovery

Alternatives

Other production approaches

OpenTelemetry + Grafana stack

Vendor-neutral; OSS; the modern default.

Datadog APM

Tightly integrated commercial stack; fastest to ship; vendor lock-in.

New Relic / Honeycomb

Honeycomb is the leader in event-driven, high-cardinality observability.

AWS X-Ray + CloudWatch

Native AWS choice; deep integration with AWS services.

ELK / EFK stack for logs

Elasticsearch-based; mature; operationally heavy at scale.

Think like an engineer

Questions to answer before shipping

Build observability before the first incident, not after. Retrofit costs are 10x.
For every alert, ask: what action does this trigger? If the answer is “none”, delete the alert.
SLO error-budget burn-rate alerts beat single-threshold alerts. Burn rate tells you how urgently to respond.

Key terms

Vocabulary used in this module

OpenTelemetry

CNCF project unifying metrics, logs, and traces under one vendor-neutral SDK.

Trace

Tree of spans representing a single request across services.

Span

Single unit of work in a trace (one HTTP call, one query, one handler).

SLO

Service Level Objective; a measurable commitment about latency, availability, etc.

Error budget

100% minus SLO; the failure allowance you can spend on risk.

Labs

Hands-on labs

60 minutesIntermediate

Lab 9.1 - Trace a Request End-to-End

Instrument a 3-service chain with OTel; trace a request across all three; visualise in Jaeger.

Add OTel SDK to each service
Propagate W3C Trace Context across HTTP/gRPC calls
Send a request; find the trace in Jaeger
Identify the slowest span

View lab on GitHub

90 minutesIntermediate

Lab 9.2 - Build the Four Golden Signals

Add Prometheus instrumentation for latency, traffic, errors, saturation; build a Grafana dashboard.

Instrument requests with histogram + counter
Track connection pool saturation
Build a Grafana dashboard with all four signals
Define an SLO and visualise the burn rate

View lab on GitHub

60 minutesIntermediate

Lab 9.3 - Incident Triage from Telemetry

Inject a partial failure; use only the dashboards and traces to identify root cause.

Inject a 30% error rate at one downstream service
Use only Grafana + Jaeger to identify which service
Identify which user-facing endpoint is most affected
Document the runbook

View lab on GitHub

Recap

Key takeaways

Observability has three pillars (metrics, logs, traces) plus a fourth (profiles); use them together
OpenTelemetry is the standard; one SDK, many backends
Distributed tracing answers latency questions metrics cannot - instrument every service
The four golden signals (latency, traffic, errors, saturation) are the minimum metric set per service
SLOs and error budgets convert observability into engineering discipline

Related resources

Observability & Debugging

Learning objectives

The Three Pillars (and the Fourth)

OpenTelemetry - The Standard

Distributed Tracing

Sampling

The Four Golden Signals

Structured Logging and Correlation IDs

SLO and Error Budget

Debugging Distributed Systems

Distributed Request Trace Timeline

Self-Check Quiz

Where this shows up

Keep these close

What usually breaks

Threats to watch

Design choices you should be able to defend

OpenTelemetry + vendor backends

Vendor SDK (Datadog, New Relic)

Push (agent sends) vs pull (Prometheus scrapes) metrics

Other production approaches

OpenTelemetry + Grafana stack

Datadog APM

New Relic / Honeycomb

AWS X-Ray + CloudWatch

ELK / EFK stack for logs

Questions to answer before shipping

Vocabulary used in this module

OpenTelemetry

Trace

Span

SLO

Error budget

Hands-on labs

Lab 9.1 - Trace a Request End-to-End

Lab 9.2 - Build the Four Golden Signals

Lab 9.3 - Incident Triage from Telemetry

Key takeaways

Keep learning across CodersSecret

Related guides

Cheatsheets

Interactive labs

Glossary terms