Observability Stack Guide

Logs tell you what happened. Metrics tell you how much. Traces tell you where time was spent. OpenTelemetry unifies all three into one instrumentation layer. Learn to build a complete observability stack from scratch.

Your service is slow. Is it the database? The cache? A downstream API? Without observability, you are guessing. With it, you can trace a single request from the browser through every service, see exactly where the 2 seconds were spent, and correlate it with system metrics and error logs.

This guide builds a complete observability stack with OpenTelemetry (the CNCF standard), covering all three pillars: logs, metrics, and distributed traces.

The Three Pillars

Pillar	What It Answers	Example	Tool
Logs	What happened?	"User 123 failed login: invalid password"	Loki, Elasticsearch
Metrics	How much/how many?	Request rate: 500 req/s, P99 latency: 230ms	Prometheus, Grafana
Traces	Where was time spent?	API: 50ms → DB: 120ms → Cache: 5ms	Jaeger, Tempo

The power comes from correlating all three. A trace shows a slow request. You check metrics to see if latency spiked system-wide. You search logs filtered by that trace ID to find the error message. OpenTelemetry makes this correlation automatic.

OpenTelemetry: One SDK for Everything

Before OpenTelemetry, you needed separate libraries for each concern: StatsD for metrics, Zipkin client for traces, structured logging for logs. OpenTelemetry provides a single, vendor-neutral instrumentation SDK that exports to any backend.

# Install OpenTelemetry for Python
pip install opentelemetry-api opentelemetry-sdk \
    opentelemetry-exporter-otlp \
    opentelemetry-instrumentation-flask \
    opentelemetry-instrumentation-requests \
    opentelemetry-instrumentation-sqlalchemy

Setting Up Traces

# tracing.py - Initialize OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

def init_tracing(service_name: str):
    resource = Resource.create({"service.name": service_name})
    provider = TracerProvider(resource=resource)

    # Export traces to the collector via gRPC
    exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
    provider.add_span_processor(BatchSpanProcessor(exporter))

    trace.set_tracer_provider(provider)

# Usage in your app
init_tracing("order-service")
tracer = trace.get_tracer("order-service")

# Create custom spans
def process_order(order_id: int):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        with tracer.start_as_current_span("validate_inventory"):
            check_inventory(order_id)

        with tracer.start_as_current_span("charge_payment"):
            charge_customer(order_id)

        with tracer.start_as_current_span("send_notification"):
            notify_customer(order_id)

# Result in Jaeger:
# process_order (350ms)
#   ├── validate_inventory (45ms)
#   ├── charge_payment (280ms)    ← The bottleneck!
#   └── send_notification (25ms)

Auto-Instrumentation

OpenTelemetry can automatically instrument popular libraries without code changes:

# Auto-instrument Flask, requests, SQLAlchemy, Redis, etc.
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

app = Flask(__name__)

FlaskInstrumentor().instrument_app(app)     # Traces every HTTP request
RequestsInstrumentor().instrument()          # Traces outgoing HTTP calls
SQLAlchemyInstrumentor().instrument()        # Traces every SQL query

# Now every request automatically generates:
# - A root span for the Flask route
# - Child spans for each outgoing HTTP request
# - Child spans for each SQL query with the actual SQL text

Setting Up Metrics

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

def init_metrics(service_name: str):
    exporter = OTLPMetricExporter(endpoint="http://otel-collector:4317")
    reader = PeriodicExportingMetricReader(exporter, export_interval_millis=10000)

    provider = MeterProvider(
        resource=Resource.create({"service.name": service_name}),
        metric_readers=[reader],
    )
    metrics.set_meter_provider(provider)

# Create custom metrics
meter = metrics.get_meter("order-service")

# Counter: total count of events
order_counter = meter.create_counter(
    name="orders.created",
    description="Total orders created",
    unit="1",
)

# Histogram: distribution of values (latency, sizes)
order_latency = meter.create_histogram(
    name="orders.processing_time",
    description="Order processing time",
    unit="ms",
)

# Up/Down Counter: current value that goes up and down
active_orders = meter.create_up_down_counter(
    name="orders.active",
    description="Currently processing orders",
)

# Record metrics
def create_order(order):
    start = time.time()
    active_orders.add(1)

    try:
        process(order)
        order_counter.add(1, {"status": "success", "region": "us-east"})
    except Exception:
        order_counter.add(1, {"status": "failed", "region": "us-east"})
        raise
    finally:
        duration = (time.time() - start) * 1000
        order_latency.record(duration)
        active_orders.add(-1)

Structured Logging with Trace Correlation

import logging
import json
from opentelemetry import trace

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            "timestamp": self.formatTime(record),
            "level": record.levelname,
            "message": record.getMessage(),
            "logger": record.name,
            "module": record.module,
        }

        # Automatically inject trace context
        span = trace.get_current_span()
        if span.is_recording():
            ctx = span.get_span_context()
            log_data["trace_id"] = format(ctx.trace_id, '032x')
            log_data["span_id"] = format(ctx.span_id, '016x')

        return json.dumps(log_data)

# Setup
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger = logging.getLogger("order-service")
logger.addHandler(handler)

# Now every log entry includes trace_id and span_id:
# {"timestamp": "2026-04-28 10:30:00", "level": "ERROR",
#  "message": "Payment failed for order 456",
#  "trace_id": "abc123def456...", "span_id": "789xyz..."}

# In Grafana: click trace_id in a log entry to jump directly
# to the corresponding trace in Jaeger/Tempo

The Collector: Central Pipeline

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

  # Add service metadata
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert

exporters:
  # Send traces to Jaeger
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  # Send metrics to Prometheus
  prometheus:
    endpoint: 0.0.0.0:8889

  # Send logs to Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Grafana Dashboards

# Docker Compose for the full stack
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib
    volumes:
      - ./otel-config.yaml:/etc/otel/config.yaml
    ports:
      - "4317:4317"   # gRPC receiver
      - "4318:4318"   # HTTP receiver
      - "8889:8889"   # Prometheus metrics

  jaeger:
    image: jaegertracing/all-in-one
    ports:
      - "16686:16686"  # Jaeger UI

  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true

  loki:
    image: grafana/loki
    ports:
      - "3100:3100"

Key Takeaways

Observability is not monitoring — monitoring tells you something is wrong, observability tells you why
OpenTelemetry is the standard — vendor-neutral, CNCF-backed, works with every observability backend
Auto-instrumentation covers 80% of needs — Flask, Django, Express, database drivers, HTTP clients all have plugins
Correlate all three pillars — trace IDs in logs let you jump from a log entry to the full request trace
Use the Collector as a central pipeline — receive from all services, process in one place, export to any backend
Histograms over averages — P99 latency reveals problems that averages hide
Start with auto-instrumentation and traces — add custom metrics and structured logging as you identify specific needs

The goal of observability is answering questions you did not know you would ask. With traces, metrics, and correlated logs, you can debug any production issue by following the data instead of guessing. Start with OpenTelemetry auto-instrumentation today — you will wonder how you ever debugged without it.

Observability Stack: Logs, Metrics, and Traces with OpenTelemetry