Production AI without observability is like driving blindfolded. You need to see: how long each pipeline step takes, how many tokens each request consumes, how much each request costs, and whether quality is holding up over time.
LLM Tracing
Trace every RAG request through: embed query (5ms) → vector search (10ms) → context assembly (1ms) → LLM call (2000ms) → total (2016ms). This tells you WHERE time is spent and WHERE to optimize.
Token Monitoring
# Track token usage per request
def monitored_rag(question: str) -> dict:
response = claude.messages.create(...)
metrics = {
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cost_usd": (response.usage.input_tokens * 0.003 + response.usage.output_tokens * 0.015) / 1000,
"model": "claude-sonnet-4-6",
}
prometheus_counter.inc(metrics["cost_usd"])
return {"answer": ..., "metrics": metrics}
Cost Monitoring
LLM API costs are the largest expense in RAG. Track cost per request, per tenant, per day. Set budgets and alerts. A runaway agent or a suddenly popular query can blow through your monthly budget in hours.