Observability Stack: Logs, Metrics, and Traces with OpenTelemetry

Logs tell you what happened. Metrics tell you how much. Traces tell you where time was spent. OpenTelemetry unifies all three into one instrumentation layer. Learn to build a complete observability stack from scratch.

Observability Stack: Logs, Metrics, and Traces with OpenTelemetry illustration
On this page9 sections

Your service is slow. Is it the database? The cache? A downstream API? Without observability, you are guessing. With it, you can trace a single request from the browser through every service, see exactly where the 2 seconds were spent, and correlate it with system metrics and error logs.

This guide builds a complete observability stack with OpenTelemetry (the CNCF standard), covering all three pillars: logs, metrics, and distributed traces.

The Three Pillars

Pillar What It Answers Example Tool
Logs What happened? "User 123 failed login: invalid password" Loki, Elasticsearch
Metrics How much/how many? Request rate: 500 req/s, P99 latency: 230ms Prometheus, Grafana
Traces Where was time spent? API: 50ms → DB: 120ms → Cache: 5ms Jaeger, Tempo

The power comes from correlating all three. A trace shows a slow request. You check metrics to see if latency spiked system-wide. You search logs filtered by that trace ID to find the error message. OpenTelemetry makes this correlation automatic.

OpenTelemetry: One SDK for Everything

Before OpenTelemetry, you needed separate libraries for each concern: StatsD for metrics, Zipkin client for traces, structured logging for logs. OpenTelemetry provides a single, vendor-neutral instrumentation SDK that exports to any backend.

# Install OpenTelemetry for Python
pip install opentelemetry-api opentelemetry-sdk \
    opentelemetry-exporter-otlp \
    opentelemetry-instrumentation-flask \
    opentelemetry-instrumentation-requests \
    opentelemetry-instrumentation-sqlalchemy

Setting Up Traces

# tracing.py - Initialize OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

def init_tracing(service_name: str):
    resource = Resource.create({"service.name": service_name})
    provider = TracerProvider(resource=resource)

    # Export traces to the collector via gRPC
    exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
    provider.add_span_processor(BatchSpanProcessor(exporter))

    trace.set_tracer_provider(provider)

# Usage in your app
init_tracing("order-service")
tracer = trace.get_tracer("order-service")

# Create custom spans
def process_order(order_id: int):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        with tracer.start_as_current_span("validate_inventory"):
            check_inventory(order_id)

        with tracer.start_as_current_span("charge_payment"):
            charge_customer(order_id)

        with tracer.start_as_current_span("send_notification"):
            notify_customer(order_id)

# Result in Jaeger:
# process_order (350ms)
#   ├── validate_inventory (45ms)
#   ├── charge_payment (280ms)    ← The bottleneck!
#   └── send_notification (25ms)

Auto-Instrumentation

OpenTelemetry can automatically instrument popular libraries without code changes:

# Auto-instrument Flask, requests, SQLAlchemy, Redis, etc.
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

app = Flask(__name__)

FlaskInstrumentor().instrument_app(app)     # Traces every HTTP request
RequestsInstrumentor().instrument()          # Traces outgoing HTTP calls
SQLAlchemyInstrumentor().instrument()        # Traces every SQL query

# Now every request automatically generates:
# - A root span for the Flask route
# - Child spans for each outgoing HTTP request
# - Child spans for each SQL query with the actual SQL text

Setting Up Metrics

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

def init_metrics(service_name: str):
    exporter = OTLPMetricExporter(endpoint="http://otel-collector:4317")
    reader = PeriodicExportingMetricReader(exporter, export_interval_millis=10000)

    provider = MeterProvider(
        resource=Resource.create({"service.name": service_name}),
        metric_readers=[reader],
    )
    metrics.set_meter_provider(provider)

# Create custom metrics
meter = metrics.get_meter("order-service")

# Counter: total count of events
order_counter = meter.create_counter(
    name="orders.created",
    description="Total orders created",
    unit="1",
)

# Histogram: distribution of values (latency, sizes)
order_latency = meter.create_histogram(
    name="orders.processing_time",
    description="Order processing time",
    unit="ms",
)

# Up/Down Counter: current value that goes up and down
active_orders = meter.create_up_down_counter(
    name="orders.active",
    description="Currently processing orders",
)

# Record metrics
def create_order(order):
    start = time.time()
    active_orders.add(1)

    try:
        process(order)
        order_counter.add(1, {"status": "success", "region": "us-east"})
    except Exception:
        order_counter.add(1, {"status": "failed", "region": "us-east"})
        raise
    finally:
        duration = (time.time() - start) * 1000
        order_latency.record(duration)
        active_orders.add(-1)

Structured Logging with Trace Correlation

import logging
import json
from opentelemetry import trace

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            "timestamp": self.formatTime(record),
            "level": record.levelname,
            "message": record.getMessage(),
            "logger": record.name,
            "module": record.module,
        }

        # Automatically inject trace context
        span = trace.get_current_span()
        if span.is_recording():
            ctx = span.get_span_context()
            log_data["trace_id"] = format(ctx.trace_id, '032x')
            log_data["span_id"] = format(ctx.span_id, '016x')

        return json.dumps(log_data)

# Setup
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger = logging.getLogger("order-service")
logger.addHandler(handler)

# Now every log entry includes trace_id and span_id:
# {"timestamp": "2026-04-28 10:30:00", "level": "ERROR",
#  "message": "Payment failed for order 456",
#  "trace_id": "abc123def456...", "span_id": "789xyz..."}

# In Grafana: click trace_id in a log entry to jump directly
# to the corresponding trace in Jaeger/Tempo

The Collector: Central Pipeline

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

  # Add service metadata
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert

exporters:
  # Send traces to Jaeger
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  # Send metrics to Prometheus
  prometheus:
    endpoint: 0.0.0.0:8889

  # Send logs to Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Grafana Dashboards

# Docker Compose for the full stack
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib
    volumes:
      - ./otel-config.yaml:/etc/otel/config.yaml
    ports:
      - "4317:4317"   # gRPC receiver
      - "4318:4318"   # HTTP receiver
      - "8889:8889"   # Prometheus metrics

  jaeger:
    image: jaegertracing/all-in-one
    ports:
      - "16686:16686"  # Jaeger UI

  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true

  loki:
    image: grafana/loki
    ports:
      - "3100:3100"

Key Takeaways

  • Observability is not monitoring — monitoring tells you something is wrong, observability tells you why
  • OpenTelemetry is the standard — vendor-neutral, CNCF-backed, works with every observability backend
  • Auto-instrumentation covers 80% of needs — Flask, Django, Express, database drivers, HTTP clients all have plugins
  • Correlate all three pillars — trace IDs in logs let you jump from a log entry to the full request trace
  • Use the Collector as a central pipeline — receive from all services, process in one place, export to any backend
  • Histograms over averages — P99 latency reveals problems that averages hide
  • Start with auto-instrumentation and traces — add custom metrics and structured logging as you identify specific needs

The goal of observability is answering questions you did not know you would ask. With traces, metrics, and correlated logs, you can debug any production issue by following the data instead of guessing. Start with OpenTelemetry auto-instrumentation today — you will wonder how you ever debugged without it.

Share this article

Stuck on implementation?

Get private, 1-on-1 help with system design, performance, scaling, or any technical challenge.

Book a Session

Related Production Resources

Course

Free learning tracks

Turn this guide into a structured production engineering path.

Lab

Interactive engineering labs

Practice the same ideas through scenario-based simulators.

Reference

Production cheatsheets

Keep the operational commands and checks nearby.

Glossary

Key terms

Review the vocabulary behind the architecture.

Discussion

Questions, corrections, or production notes? Add them here so other learners can benefit.

Continue Reading

Related practical guides from the same production engineering path.

DevOps 8 min read

Modern Data Platforms Compared: Snowflake, Databricks, BigQuery, and e6data

Compare Snowflake, Databricks, BigQuery, and e6data through the production decisions that matter: storage, compute, governance, table formats, cost control, and workload fit.

Data Engineering Snowflake
DevOps 10 min read

Why Spark Jobs Become Slow: Shuffle, Skew, Partitions, and Memory

Spark jobs usually slow down for predictable reasons: too much shuffle, skewed keys, bad partition sizing, expensive file layouts, and memory pressure. Learn how to debug each one.

Spark Data Engineering