Logs, metrics, traces, and alerts for production AI agents — visibility into what agents do, how they perform, and when they go wrong.
Traditional monitoring answers "is it broken?" Observability answers "what happened?" An agent that fails silently is worse than one that fails loudly. You need visibility into three dimensions: Logs (what happened, when, with what inputs), Metrics (how many, how fast, how much), Traces (the full request journey through your system).
The critical difference: Monitoring tells you the agent is down. Observability tells you why—and prevents it from happening again.
logs metrics traces visibility
Most teams log randomly. Effective logging is structured: every agent action emits a JSON object containing the same fields. Required: timestamp (when), agent_id (which), session_id (context), step (which iteration), action (tool call or decision), input (what we passed), output (what we got), latency (how long), cost (tokens/API calls).
Unstructured logs: "Agent failed." Structured logs: you can query, aggregate, alert, and replay any agent session.
JSON logging structured fields immutable trace
Latency metrics — response time, time per step, end-to-end duration (alerts on slowdown). Throughput metrics — requests per second, tasks completed, parallelism (capacity planning). Cost metrics — tokens per request, API call costs, cost per task (prevents runaway bills). Quality metrics — success rate, error rate, eval scores, policy violation rate (indicates degradation).
Track all four. Teams that only track latency miss cost explosions. Teams that only track success rate miss slow regressions.
latency throughput cost quality
Your dashboard should answer: Is the agent healthy right now? Top-level cards: success rate (%), p99 latency (ms), error count, cost per task ($). Second level: breakdown by task type, agent version, error type. Third level: drill-down into individual sessions for debugging.
Never ship a dashboard that requires clicking through 5 screens to find the answer. Put the answer on the front screen.
SLO tracking drill-down health status
Token costs are the most dangerous metric for agents: small bugs can lead to $1000s in overage within hours. Monitor: tokens per step (should be stable), total tokens per session (alert if > N%), and cost per session ($). Set hard limits: if cost per session > $X, fail fast and alert on-call.
A runaway agent that costs $100/session can rack up $50k in a day if it hits production undetected.
cost limits token budgets runaway detection
Alert fatigue kills observability. Only alert on: (1) Hard SLO breaches — p99 latency > 5s, error rate > 2%, (2) Cost anomalies — cost per session 10x baseline, (3) Availability — no successful requests in 5 minutes, (4) Behavioral change — error type distribution shifted. Do not alert on: "request took 1.5s" (noise).
SLO alerts cost alerts actionable
A single request to your system may touch 5+ agents, each calling tools, waiting for responses, making decisions. A trace is the full ancestry tree: root request → orchestrator agent → researcher agent → writer agent → reviewer agent → result. Each node is a span (unit of work). The trace context (trace ID) ties all spans together.
trace ID span hierarchy context propagation
A good tracing system lets you: (1) Reproduce the exact execution (all inputs/outputs), (2) See exactly where latency was spent (which agent was slow?), (3) Identify which tool call failed, (4) Trace costs to specific agents/tools. Without this, you're debugging blind.
Trace-based debugging is so powerful that it's worth the infrastructure investment even for small systems.
replay latency attribution cost attribution
Spans form a tree. Root span wraps the entire request. Child spans are agent steps. Grandchild spans are tool calls. This hierarchy lets you ask: "Why was my orchestrator slow?" Answer: look at its child spans. "Why was tool X slow?" Look at its duration. "Which tool called the payment API?" Trace the hierarchy.
parent-child hierarchy causality
Some failures are obvious (error rate spikes). Others are subtle (agent using 2x tokens, taking 5x steps, calling unfamiliar tools). Set baselines: your agent's normal token usage is 200/request. When it hits 400+, that's a signal. Normal step count is 4. When it hits 10, something changed. These anomalies often precede outages.
baseline drift statistical detection early warning
A circuit breaker is a pattern: if error rate exceeds X% for Y seconds, automatically stop accepting requests and return a fallback response (cached result, simpler agent, human escalation). Don't wait for a human to notice and manually kill the agent. Let the system self-heal.
Three states: Closed (healthy, requests flow through), Open (too many errors, reject requests), Half-Open (recovering, allow probe requests to test health).
circuit breaker auto-failover self-healing
Observability is only useful if someone acts on the signals. Define: who gets paged? When? What's the runbook? If your only on-call runbook is "restart the agent," you have a systemic problem. The runbook should be "check the trace, identify the issue, consider: upgrade model? Add guardrail? Rollback? Escalate?"
on-call rotation runbook game day
A prototype logs to stdout. Production logs to a centralized system (CloudWatch, Datadog, New Relic). Metrics go to a time-series DB (Prometheus, InfluxDB). Traces go to a specialized backend (Jaeger, Honeycomb). Cost: cheap. Value: priceless when 3am incident hits.
The time to set up observability is before you need it.
centralized logging time-series metrics distributed tracing
Observability isn't just for debugging — it's required for compliance. GDPR requires you to explain what happened to a user's data. SOC2 requires audit trails. HIPAA requires immutable logs. These requirements force you to have observability anyway. Make it work for both debugging and compliance.
audit trail GDPR compliance
① Log all side effects — agent actions that change state must be logged.
② Trace context everywhere — propagate trace ID through all requests.
③ Structured, immutable logs — JSON, write-once, searchable.
④ Track all four metric types — latency, throughput, cost, quality.
⑤ Set SLOs before launch — know your thresholds in advance.
⑥ Circuit breakers for safety — auto-failover on degradation.
⑦ Alert on anomaly, not threshold — detect drift, not just absolutes.
⑧ Use traces for debugging — never guess, always replay.
⑨ On-call training required — observability only works if people use it.
⑩ Ship with observability — don't retrofit after production breaks.