Module 9 of 12

Observability & Debugging

Distributed tracing, metrics, structured logging, correlation IDs, and the OpenTelemetry / Prometheus / Grafana / Jaeger stack that lets you debug systems you cannot SSH into.

4 hours3 labsFree

Start here

Learning objectives

  • Instrument a service with OpenTelemetry traces, metrics, and logs
  • Correlate a single request across many services via trace IDs
  • Build the four golden signals (latency, traffic, errors, saturation) in Prometheus
  • Read a distributed trace and identify where latency accrues
  • Build the runbook a 3am on-call engineer actually uses

Before

  • Print-statement debugging across distributed services; root-cause takes hours
  • Per-service dashboards in different tools; correlation is manual
  • Alert fatigue from threshold-based alerts that fire on noise
  • No SLOs; everyone has different definitions of “working”

After

  • Distributed traces correlate the full request across services
  • Unified Grafana on top of Prometheus + Tempo + Loki; single pane of glass
  • SLO + error-budget burn-rate alerts; signal over noise
  • Documented SLOs aligned to user experience; product decisions tied to error budget
DISTRIBUTED TRACING PIPELINEService AOTel SDKService BOTel SDKService COTel SDKOTel Collectorbatches, samples, fans out to backendsJaeger / TempotracesPrometheusmetricsLoki / ESlogsGrafanaunified UI

Observability is what separates a system you can debug from one you cannot. In a monolith, debugging means reading a stack trace. In a distributed system, debugging means reconstructing a request across many services from telemetry alone. If your observability is poor, your incidents are unrecoverable. If it is good, the system tells you where it broke.

The Three Pillars (and the Fourth)

  • Metrics: numeric time-series. Cheap to store, cheap to query, great for alerting and dashboards. Prometheus is the open-source standard.
  • Logs: discrete events with timestamps and structure. Expensive to store at scale; great for debugging known issues.
  • Traces: per-request causality across services. Heavy to capture and store; essential for debugging latency.
  • Profiles (the modern fourth): CPU and allocation samples per service over time (Pyroscope, Parca). Catches what metrics aggregate away.

OpenTelemetry — The Standard

OpenTelemetry (OTel) is the CNCF project that unifies instrumentation. One SDK in your application emits all three signals; an OTel Collector batches, samples, and ships them to whichever backends you choose. The vendor lock-in problem of older instrumentation libraries is solved.

Production pattern: every service uses the OTel SDK; an OTel Collector runs as a DaemonSet on Kubernetes; the Collector forwards traces to Jaeger/Tempo, metrics to Prometheus, logs to Loki or Elasticsearch. Grafana is the unified UI on top.

Distributed Tracing

A trace is a tree of spans, where each span represents a unit of work (an HTTP call, a database query, a message handler). The root span is the user request; child spans are everything it triggered. Spans carry a trace ID (propagated across service boundaries via the W3C Trace Context header) and a span ID (parent reference).

Tracing answers questions metrics cannot: why is p99 of request type X high? A trace shows you exactly which downstream service contributed the latency. Why does this rare error happen? The trace shows the full causal chain. Where is this request actually going? The trace reveals architectural surprises (services calling services you forgot existed).

Sampling

Tracing every request is expensive at scale. Sampling reduces volume:

  • Head-based sampling: decide at the start of the trace whether to keep it (e.g. 1% of all requests). Simple; misses error traces.
  • Tail-based sampling: collect everything, decide after the trace completes (e.g. keep all error traces and a sample of success traces). Better visibility; harder to operate.
  • Adaptive / hybrid: head-sample at modest rate; force-sample known interesting paths (errors, slow requests, specific endpoints).

The Four Golden Signals

From the Google SRE book, the four metrics every service should emit:

  1. Latency: time to serve a request. Track p50, p95, p99; alert on p99.
  2. Traffic: requests per second. Sudden change is a signal even if everything else looks fine.
  3. Errors: failed requests. Express as a rate (errors per second) or a ratio (error rate over total RPS).
  4. Saturation: how full the system is. CPU, memory, queue depth, connection pool utilisation. Saturation precedes other failures.

Every dashboard, every alert, every SLO connects back to these four. If you only emit four metrics per service, emit these.

Structured Logging and Correlation IDs

Logs are useful when they are queryable. That means structured (JSON or key=value) and correlated. Every log line should include the trace ID so you can filter by request and the user/tenant ID so you can debug per-user issues.

The minimum log line for a service: timestamp, level, service, trace_id, span_id, user_id, message, ...fields. Anything less and your logs are unsearchable at scale.

SLO and Error Budget

Service Level Objectives (SLOs) translate the four golden signals into commitments. “p99 latency < 300ms over 30 days” or “99.9% of requests return 2xx/3xx over 30 days”. The error budget is the difference between 100% and the SLO — the amount of failure you have permission to spend on risky changes, deploys, or experiments.

Operating with explicit SLOs and error budgets is the discipline of the modern SRE function. The pattern: when error budget is healthy, you ship features fast; when it is exhausted, you stop shipping and stabilise.

Debugging Distributed Systems

The flow that works in practice:

  1. Alert fires ⇒ identify the affected service from dashboard.
  2. Check the four golden signals for that service.
  3. Pick a representative failed trace; walk it span by span; identify where latency or error appears.
  4. If it is a downstream service, recurse: open that service's dashboard, repeat.
  5. If it is in the service itself, jump to logs filtered by that trace ID.
  6. If logs do not show the cause, jump to profiles.

That's the loop. Every minute saved in this loop is a minute off MTTR.

Distributed Request Trace Timeline

DISTRIBUTED TRACE — SPAN TIMELINE0ms200msroot: GET /api/checkout (200ms)auth (25ms)cart-service (60ms)db.query (35ms)payment-svc (85ms) ← sloweststripe-api (70ms)log (12ms)payment-svc → stripe-api dominates p99 latency. Optimisation target identified at a glance.

Self-Check Quiz

  1. Why are metrics, logs, and traces complementary rather than redundant? (Answer: metrics tell you something is wrong; traces tell you which service; logs tell you why. Each answers a different question at a different cost.)
  2. You have head-based sampling at 1%. Errors get under-represented. What is the fix? (Answer: tail-based sampling — sample after the trace completes, force-include error traces. Or hybrid head+force-on-error.)
  3. Your dashboard shows error rate at 0.1% — within SLO. Customer support says many users complain. What is happening? (Answer: averages hide tail. Your 0.1% may be concentrated on one tenant or one feature. Slice metrics by user/tenant/feature, not just service-level.)
  4. What are the four golden signals and why does saturation matter? (Answer: latency, traffic, errors, saturation. Saturation precedes the other three failing — the queue fills before latency climbs before errors fire.)

For runtime-detection observability the Runtime Security cheatsheet covers Falco/Tetragon eBPF telemetry alongside the application-layer signals.

Real world

Where this shows up

  • Netflix runs distributed tracing across thousands of services with sampled tail-based collection.
  • Google&apos;s Dapper paper (2010) is the foundation of modern distributed tracing.
  • Cloudflare uses Honeycomb (event-driven observability) for high-cardinality investigation.
  • Uber built Jaeger (now CNCF) to handle their tracing volume; donated it to the community.

Production notes

Keep these close

  • Use OpenTelemetry; one SDK, many backends. Vendor-specific SDKs are a future migration cost.
  • Sample tail-based, not head-based, for error visibility. Or use head-based at modest rate plus force-sample on errors.
  • Define SLOs that map to user experience, not system health. p99 latency on the checkout flow matters; p99 on the health-check endpoint does not.
  • Tag every log with trace_id and user_id. Without correlation, logs at scale are unsearchable.

Common mistakes

What usually breaks

  • Tracing the easy services first. The services without tracing become invisible &mdash; usually the legacy ones causing the incidents.
  • Alert on anything that wiggles. Alert fatigue is a category of incident on its own.
  • Track averages instead of percentiles. Means hide tail behaviour; p99 is the truth.
  • Burn through error budget without slowing down. The whole point is to slow shipping when the budget is exhausted.

Security risks

Threats to watch

  • Logs containing tokens, passwords, or PII become a parallel data-exfiltration target. Apply structured-log scrubbing at the collector.
  • OpenTelemetry collectors with default settings expose internal trace data via debug endpoints. Lock them down.
  • Trace context (W3C Trace Context header) propagates across trust boundaries; sanitise before forwarding to third parties.

Tradeoffs

Design choices you should be able to defend

OpenTelemetry + vendor backends

Pros

  • Vendor-neutral
  • Active CNCF project
  • Wide language support

Cons

  • Newer than Jaeger/Zipkin native SDKs
  • Some maturity gaps

Vendor SDK (Datadog, New Relic)

Pros

  • Tightest integration with their UI
  • Quickest to ship

Cons

  • Lock-in
  • Per-language coverage varies

Push (agent sends) vs pull (Prometheus scrapes) metrics

Pros

  • Push handles short-lived workloads
  • Pull works well for stable services

Cons

  • Push needs aggregation; pull needs service discovery

Alternatives

Other production approaches

OpenTelemetry + Grafana stack

Vendor-neutral; OSS; the modern default.

Datadog APM

Tightly integrated commercial stack; fastest to ship; vendor lock-in.

New Relic / Honeycomb

Honeycomb is the leader in event-driven, high-cardinality observability.

AWS X-Ray + CloudWatch

Native AWS choice; deep integration with AWS services.

ELK / EFK stack for logs

Elasticsearch-based; mature; operationally heavy at scale.

Think like an engineer

Questions to answer before shipping

  • Build observability before the first incident, not after. Retrofit costs are 10x.
  • For every alert, ask: what action does this trigger? If the answer is &ldquo;none&rdquo;, delete the alert.
  • SLO error-budget burn-rate alerts beat single-threshold alerts. Burn rate tells you how urgently to respond.

Key terms

Vocabulary used in this module

OpenTelemetry

CNCF project unifying metrics, logs, and traces under one vendor-neutral SDK.

Trace

Tree of spans representing a single request across services.

Span

Single unit of work in a trace (one HTTP call, one query, one handler).

SLO

Service Level Objective; a measurable commitment about latency, availability, etc.

Error budget

100% minus SLO; the failure allowance you can spend on risk.

Labs

Hands-on labs

60 minutesIntermediate

Lab 9.1 — Trace a Request End-to-End

Instrument a 3-service chain with OTel; trace a request across all three; visualise in Jaeger.

  1. Add OTel SDK to each service
  2. Propagate W3C Trace Context across HTTP/gRPC calls
  3. Send a request; find the trace in Jaeger
  4. Identify the slowest span
View lab on GitHub
90 minutesIntermediate

Lab 9.2 — Build the Four Golden Signals

Add Prometheus instrumentation for latency, traffic, errors, saturation; build a Grafana dashboard.

  1. Instrument requests with histogram + counter
  2. Track connection pool saturation
  3. Build a Grafana dashboard with all four signals
  4. Define an SLO and visualise the burn rate
View lab on GitHub
60 minutesIntermediate

Lab 9.3 — Incident Triage from Telemetry

Inject a partial failure; use only the dashboards and traces to identify root cause.

  1. Inject a 30% error rate at one downstream service
  2. Use only Grafana + Jaeger to identify which service
  3. Identify which user-facing endpoint is most affected
  4. Document the runbook
View lab on GitHub

Recap

Key takeaways

  • Observability has three pillars (metrics, logs, traces) plus a fourth (profiles); use them together
  • OpenTelemetry is the standard; one SDK, many backends
  • Distributed tracing answers latency questions metrics cannot &mdash; instrument every service
  • The four golden signals (latency, traffic, errors, saturation) are the minimum metric set per service
  • SLOs and error budgets convert observability into engineering discipline

Related resources

Keep learning across CodersSecret