Skip to main content

Module 9: Observability & Debugging

Distributed tracing, metrics, structured logging, correlation IDs, and the OpenTelemetry / Prometheus / Grafana / Jaeger stack that lets you debug systems you cannot SSH into.

4 hours. 3 hands-on labs. Free course module.

Learning Objectives

  • Instrument a service with OpenTelemetry traces, metrics, and logs
  • Correlate a single request across many services via trace IDs
  • Build the four golden signals (latency, traffic, errors, saturation) in Prometheus
  • Read a distributed trace and identify where latency accrues
  • Build the runbook a 3am on-call engineer actually uses

Why This Matters

Observability is the difference between debuggable and undebuggable systems. A distributed system you cannot trace is a distributed system you cannot operate at scale. Engineers who build observability in from the start have meaningfully shorter MTTR; engineers who bolt it on after the first incident spend years catching up. SLOs and error budgets convert observability into engineering discipline that aligns product velocity with reliability.

DISTRIBUTED TRACING PIPELINEService AOTel SDKService BOTel SDKService COTel SDKOTel Collectorbatches, samples, fans out to backendsJaeger / TempotracesPrometheusmetricsLoki / ESlogsGrafanaunified UI
Architecture diagram for Module 9: Observability & Debugging.

Lesson Content

Observability is what separates a system you can debug from one you cannot. In a monolith, debugging means reading a stack trace. In a distributed system, debugging means reconstructing a request across many services from telemetry alone. If your observability is poor, your incidents are unrecoverable. If it is good, the system tells you where it broke.

The Three Pillars (and the Fourth)

  • Metrics: numeric time-series. Cheap to store, cheap to query, great for alerting and dashboards. Prometheus is the open-source standard.
  • Logs: discrete events with timestamps and structure. Expensive to store at scale; great for debugging known issues.
  • Traces: per-request causality across services. Heavy to capture and store; essential for debugging latency.
  • Profiles (the modern fourth): CPU and allocation samples per service over time (Pyroscope, Parca). Catches what metrics aggregate away.

OpenTelemetry — The Standard

OpenTelemetry (OTel) is the CNCF project that unifies instrumentation. One SDK in your application emits all three signals; an OTel Collector batches, samples, and ships them to whichever backends you choose. The vendor lock-in problem of older instrumentation libraries is solved.

Production pattern: every service uses the OTel SDK; an OTel Collector runs as a DaemonSet on Kubernetes; the Collector forwards traces to Jaeger/Tempo, metrics to Prometheus, logs to Loki or Elasticsearch. Grafana is the unified UI on top.

Distributed Tracing

A trace is a tree of spans, where each span represents a unit of work (an HTTP call, a database query, a message handler). The root span is the user request; child spans are everything it triggered. Spans carry a trace ID (propagated across service boundaries via the W3C Trace Context header) and a span ID (parent reference).

Tracing answers questions metrics cannot: why is p99 of request type X high? A trace shows you exactly which downstream service contributed the latency. Why does this rare error happen? The trace shows the full causal chain. Where is this request actually going? The trace reveals architectural surprises (services calling services you forgot existed).

Sampling

Tracing every request is expensive at scale. Sampling reduces volume:

  • Head-based sampling: decide at the start of the trace whether to keep it (e.g. 1% of all requests). Simple; misses error traces.
  • Tail-based sampling: collect everything, decide after the trace completes (e.g. keep all error traces and a sample of success traces). Better visibility; harder to operate.
  • Adaptive / hybrid: head-sample at modest rate; force-sample known interesting paths (errors, slow requests, specific endpoints).

The Four Golden Signals

From the Google SRE book, the four metrics every service should emit:

  1. Latency: time to serve a request. Track p50, p95, p99; alert on p99.
  2. Traffic: requests per second. Sudden change is a signal even if everything else looks fine.
  3. Errors: failed requests. Express as a rate (errors per second) or a ratio (error rate over total RPS).
  4. Saturation: how full the system is. CPU, memory, queue depth, connection pool utilisation. Saturation precedes other failures.

Every dashboard, every alert, every SLO connects back to these four. If you only emit four metrics per service, emit these.

Structured Logging and Correlation IDs

Logs are useful when they are queryable. That means structured (JSON or key=value) and correlated. Every log line should include the trace ID so you can filter by request and the user/tenant ID so you can debug per-user issues.

The minimum log line for a service: timestamp, level, service, trace_id, span_id, user_id, message, ...fields. Anything less and your logs are unsearchable at scale.

SLO and Error Budget

Service Level Objectives (SLOs) translate the four golden signals into commitments. “p99 latency < 300ms over 30 days” or “99.9% of requests return 2xx/3xx over 30 days”. The error budget is the difference between 100% and the SLO — the amount of failure you have permission to spend on risky changes, deploys, or experiments.

Operating with explicit SLOs and error budgets is the discipline of the modern SRE function. The pattern: when error budget is healthy, you ship features fast; when it is exhausted, you stop shipping and stabilise.

Debugging Distributed Systems

The flow that works in practice:

  1. Alert fires ⇒ identify the affected service from dashboard.
  2. Check the four golden signals for that service.
  3. Pick a representative failed trace; walk it span by span; identify where latency or error appears.
  4. If it is a downstream service, recurse: open that service's dashboard, repeat.
  5. If it is in the service itself, jump to logs filtered by that trace ID.
  6. If logs do not show the cause, jump to profiles.

That's the loop. Every minute saved in this loop is a minute off MTTR.

Distributed Request Trace Timeline

DISTRIBUTED TRACE — SPAN TIMELINE0ms200msroot: GET /api/checkout (200ms)auth (25ms)cart-service (60ms)db.query (35ms)payment-svc (85ms) ← sloweststripe-api (70ms)log (12ms)payment-svc → stripe-api dominates p99 latency. Optimisation target identified at a glance.

Self-Check Quiz

  1. Why are metrics, logs, and traces complementary rather than redundant? (Answer: metrics tell you something is wrong; traces tell you which service; logs tell you why. Each answers a different question at a different cost.)
  2. You have head-based sampling at 1%. Errors get under-represented. What is the fix? (Answer: tail-based sampling — sample after the trace completes, force-include error traces. Or hybrid head+force-on-error.)
  3. Your dashboard shows error rate at 0.1% — within SLO. Customer support says many users complain. What is happening? (Answer: averages hide tail. Your 0.1% may be concentrated on one tenant or one feature. Slice metrics by user/tenant/feature, not just service-level.)
  4. What are the four golden signals and why does saturation matter? (Answer: latency, traffic, errors, saturation. Saturation precedes the other three failing — the queue fills before latency climbs before errors fire.)

For runtime-detection observability the Runtime Security cheatsheet covers Falco/Tetragon eBPF telemetry alongside the application-layer signals.

Real-World Use Cases

  • Netflix runs distributed tracing across thousands of services with sampled tail-based collection.
  • Google&apos;s Dapper paper (2010) is the foundation of modern distributed tracing.
  • Cloudflare uses Honeycomb (event-driven observability) for high-cardinality investigation.
  • Uber built Jaeger (now CNCF) to handle their tracing volume; donated it to the community.

Production Notes

  • Use OpenTelemetry; one SDK, many backends. Vendor-specific SDKs are a future migration cost.
  • Sample tail-based, not head-based, for error visibility. Or use head-based at modest rate plus force-sample on errors.
  • Define SLOs that map to user experience, not system health. p99 latency on the checkout flow matters; p99 on the health-check endpoint does not.
  • Tag every log with trace_id and user_id. Without correlation, logs at scale are unsearchable.

Common Mistakes

  • Tracing the easy services first. The services without tracing become invisible &mdash; usually the legacy ones causing the incidents.
  • Alert on anything that wiggles. Alert fatigue is a category of incident on its own.
  • Track averages instead of percentiles. Means hide tail behaviour; p99 is the truth.
  • Burn through error budget without slowing down. The whole point is to slow shipping when the budget is exhausted.

Security Risks to Watch

  • Logs containing tokens, passwords, or PII become a parallel data-exfiltration target. Apply structured-log scrubbing at the collector.
  • OpenTelemetry collectors with default settings expose internal trace data via debug endpoints. Lock them down.
  • Trace context (W3C Trace Context header) propagates across trust boundaries; sanitise before forwarding to third parties.

Design Tradeoffs

OpenTelemetry + vendor backends

Pros

  • Vendor-neutral
  • Active CNCF project
  • Wide language support

Cons

  • Newer than Jaeger/Zipkin native SDKs
  • Some maturity gaps

Vendor SDK (Datadog, New Relic)

Pros

  • Tightest integration with their UI
  • Quickest to ship

Cons

  • Lock-in
  • Per-language coverage varies

Push (agent sends) vs pull (Prometheus scrapes) metrics

Pros

  • Push handles short-lived workloads
  • Pull works well for stable services

Cons

  • Push needs aggregation; pull needs service discovery

Production Alternatives

  • OpenTelemetry + Grafana stack: Vendor-neutral; OSS; the modern default.
  • Datadog APM: Tightly integrated commercial stack; fastest to ship; vendor lock-in.
  • New Relic / Honeycomb: Honeycomb is the leader in event-driven, high-cardinality observability.
  • AWS X-Ray + CloudWatch: Native AWS choice; deep integration with AWS services.
  • ELK / EFK stack for logs: Elasticsearch-based; mature; operationally heavy at scale.

Think Like an Engineer

  • Build observability before the first incident, not after. Retrofit costs are 10x.
  • For every alert, ask: what action does this trigger? If the answer is &ldquo;none&rdquo;, delete the alert.
  • SLO error-budget burn-rate alerts beat single-threshold alerts. Burn rate tells you how urgently to respond.

Production Story

A team operated a 12-service architecture for two years without distributed tracing. Every incident took hours of cross-team Slack to root-cause: &ldquo;Did A call B? Did B call C? Where did the latency happen?&rdquo;. After OpenTelemetry rollout, MTTR dropped from 90 minutes to 12 minutes. The next major incident was triaged in 8 minutes because the engineer could see exactly which downstream service contributed the latency. The investment was 2 weeks of platform work; the payback was permanent.

Key Terms

OpenTelemetry
CNCF project unifying metrics, logs, and traces under one vendor-neutral SDK.
Trace
Tree of spans representing a single request across services.
Span
Single unit of work in a trace (one HTTP call, one query, one handler).
SLO
Service Level Objective; a measurable commitment about latency, availability, etc.
Error budget
100% minus SLO; the failure allowance you can spend on risk.

Hands-On Labs

  1. Lab 9.1 — Trace a Request End-to-End

    Instrument a 3-service chain with OTel; trace a request across all three; visualise in Jaeger.

    60 minutes - Intermediate

    • Add OTel SDK to each service
    • Propagate W3C Trace Context across HTTP/gRPC calls
    • Send a request; find the trace in Jaeger
    • Identify the slowest span

    View lab files on GitHub

  2. Lab 9.2 — Build the Four Golden Signals

    Add Prometheus instrumentation for latency, traffic, errors, saturation; build a Grafana dashboard.

    90 minutes - Intermediate

    • Instrument requests with histogram + counter
    • Track connection pool saturation
    • Build a Grafana dashboard with all four signals
    • Define an SLO and visualise the burn rate

    View lab files on GitHub

  3. Lab 9.3 — Incident Triage from Telemetry

    Inject a partial failure; use only the dashboards and traces to identify root cause.

    60 minutes - Intermediate

    • Inject a 30% error rate at one downstream service
    • Use only Grafana + Jaeger to identify which service
    • Identify which user-facing endpoint is most affected
    • Document the runbook

    View lab files on GitHub

Key Takeaways

  • Observability has three pillars (metrics, logs, traces) plus a fourth (profiles); use them together
  • OpenTelemetry is the standard; one SDK, many backends
  • Distributed tracing answers latency questions metrics cannot &mdash; instrument every service
  • The four golden signals (latency, traffic, errors, saturation) are the minimum metric set per service
  • SLOs and error budgets convert observability into engineering discipline