Monitoring & Observability

Why Observability Matters

The Three Pillars of Observability

Traditional monitoring answers "is it broken?" Observability answers "what happened?" An agent that fails silently is worse than one that fails loudly. You need visibility into three dimensions: Logs (what happened, when, with what inputs), Metrics (how many, how fast, how much), Traces (the full request journey through your system).

The critical difference: Monitoring tells you the agent is down. Observability tells you why—and prevents it from happening again.

You can't debug what you can't see. Production agents that lack observability are black boxes. When something goes wrong, you have no evidence. Observability is not optional; it's insurance against failure.

logs metrics traces visibility

Structured Logging for Agents

Logs That Tell a Story

Most teams log randomly. Effective logging is structured: every agent action emits a JSON object containing the same fields. Required: timestamp (when), agent_id (which), session_id (context), step (which iteration), action (tool call or decision), input (what we passed), output (what we got), latency (how long), cost (tokens/API calls).

Unstructured logs: "Agent failed." Structured logs: you can query, aggregate, alert, and replay any agent session.

Log everything that's irrevocable. Agent reads a file — maybe log it. Agent deletes a record — absolutely log it. Agent makes an API call — definitely log it with full request/response. If you can't replay it, you can't investigate it.

JSON logging structured fields immutable trace

Key Metrics to Track

Four Metric Categories

Latency metrics — response time, time per step, end-to-end duration (alerts on slowdown). Throughput metrics — requests per second, tasks completed, parallelism (capacity planning). Cost metrics — tokens per request, API call costs, cost per task (prevents runaway bills). Quality metrics — success rate, error rate, eval scores, policy violation rate (indicates degradation).

Track all four. Teams that only track latency miss cost explosions. Teams that only track success rate miss slow regressions.

Set SLOs before you need them. "p99 latency < 2s" "error rate < 0.5%" "cost per task < $0.10" — put these in code. Alert when you breach them.

latency throughput cost quality

Agent-Specific Dashboards

Building the Right Dashboard

Your dashboard should answer: Is the agent healthy right now? Top-level cards: success rate (%), p99 latency (ms), error count, cost per task ($). Second level: breakdown by task type, agent version, error type. Third level: drill-down into individual sessions for debugging.

Never ship a dashboard that requires clicking through 5 screens to find the answer. Put the answer on the front screen.

Dashboards are for humans. They should be glanceable. A healthy agent should look "green" at a glance. Anomalies should jump out immediately. If your dashboard requires reading, it's poorly designed.

SLO tracking drill-down health status

Cost Monitoring

Preventing Token Blowouts

Token costs are the most dangerous metric for agents: small bugs can lead to $1000s in overage within hours. Monitor: tokens per step (should be stable), total tokens per session (alert if > N%), and cost per session ($). Set hard limits: if cost per session > $X, fail fast and alert on-call.

A runaway agent that costs $100/session can rack up $50k in a day if it hits production undetected.

Cost limits are circuit breakers. Not soft warnings. Hard stops. If an agent's cost exceeds threshold, kill it immediately. Better to fail a request than lose thousands of dollars.

cost limits token budgets runaway detection

Real-Time Alerting

Alert on What Matters

Alert fatigue kills observability. Only alert on: (1) Hard SLO breaches — p99 latency > 5s, error rate > 2%, (2) Cost anomalies — cost per session 10x baseline, (3) Availability — no successful requests in 5 minutes, (4) Behavioral change — error type distribution shifted. Do not alert on: "request took 1.5s" (noise).

Every alert should be actionable. If you receive an alert and don't know what to do, it's a bad alert. Delete it. The goal is on-call people who respond to signals, not humans who silence alerts.

SLO alerts cost alerts actionable

Tracing Multi-Agent Workflows

Following the Full Journey

A single request to your system may touch 5+ agents, each calling tools, waiting for responses, making decisions. A trace is the full ancestry tree: root request → orchestrator agent → researcher agent → writer agent → reviewer agent → result. Each node is a span (unit of work). The trace context (trace ID) ties all spans together.

Trace IDs are your debugging superpower. When a user says "something went wrong," ask them for the trace ID. You can instantly see every agent decision, tool call, and output that happened. Without traces, debugging is guessing.

trace ID span hierarchy context propagation

Debugging With Traces

Replay & Investigate

A good tracing system lets you: (1) Reproduce the exact execution (all inputs/outputs), (2) See exactly where latency was spent (which agent was slow?), (3) Identify which tool call failed, (4) Trace costs to specific agents/tools. Without this, you're debugging blind.

Trace-based debugging is so powerful that it's worth the infrastructure investment even for small systems.

Ship with tracing from day one. Adding tracing to a system that wasn't built with it is 10x harder. Build it in from the start.

replay latency attribution cost attribution

Span Hierarchy

Parent-Child Relationships

Spans form a tree. Root span wraps the entire request. Child spans are agent steps. Grandchild spans are tool calls. This hierarchy lets you ask: "Why was my orchestrator slow?" Answer: look at its child spans. "Why was tool X slow?" Look at its duration. "Which tool called the payment API?" Trace the hierarchy.

Span hierarchy is the narrative of your request. Read it like a story: orchestrator decided → delegated to researcher → researcher searched web → researcher waited for response → returned to orchestrator. That's the flow.

parent-child hierarchy causality

Behavioral Anomaly Detection

Catching the Unexpected

Some failures are obvious (error rate spikes). Others are subtle (agent using 2x tokens, taking 5x steps, calling unfamiliar tools). Set baselines: your agent's normal token usage is 200/request. When it hits 400+, that's a signal. Normal step count is 4. When it hits 10, something changed. These anomalies often precede outages.

Anomalies are more useful than thresholds. "Alert if latency > 2s" is fragile. "Alert if latency is 3 stddevs above rolling mean" adapts as your system changes. Use statistical anomaly detection.

baseline drift statistical detection early warning

Circuit Breakers

Automated Failover

A circuit breaker is a pattern: if error rate exceeds X% for Y seconds, automatically stop accepting requests and return a fallback response (cached result, simpler agent, human escalation). Don't wait for a human to notice and manually kill the agent. Let the system self-heal.

Three states: Closed (healthy, requests flow through), Open (too many errors, reject requests), Half-Open (recovering, allow probe requests to test health).

Circuit breakers save money and user experience. An agent that fails fast with a fallback is better than one that returns wrong answers. Better still: fail fast and alert, then the human investigates.

circuit breaker auto-failover self-healing

On-Call & Escalation

Humans in the Loop

Observability is only useful if someone acts on the signals. Define: who gets paged? When? What's the runbook? If your only on-call runbook is "restart the agent," you have a systemic problem. The runbook should be "check the trace, identify the issue, consider: upgrade model? Add guardrail? Rollback? Escalate?"

On-call is a practice. Run game days: simulate production issues and practice response. The team that's practiced with traces will diagnose in minutes. The team that hasn't will panic.

on-call rotation runbook game day

Observability at Scale

From Prototype to Production

A prototype logs to stdout. Production logs to a centralized system (CloudWatch, Datadog, New Relic). Metrics go to a time-series DB (Prometheus, InfluxDB). Traces go to a specialized backend (Jaeger, Honeycomb). Cost: cheap. Value: priceless when 3am incident hits.

The time to set up observability is before you need it.

Instrument first, debug second. Don't wait for production incident to add tracing. Ship with it.

centralized logging time-series metrics distributed tracing

Compliance & Audit

Observability for Compliance

Observability isn't just for debugging — it's required for compliance. GDPR requires you to explain what happened to a user's data. SOC2 requires audit trails. HIPAA requires immutable logs. These requirements force you to have observability anyway. Make it work for both debugging and compliance.

Observability is your compliance evidence. When an auditor asks "did your agent access user data inappropriately?" You can answer: "here's the full trace of every access, with timestamp, user ID, approval status."

audit trail GDPR compliance

15 Observability Principles

10 Rules of Observability

① Log all side effects — agent actions that change state must be logged.
② Trace context everywhere — propagate trace ID through all requests.
③ Structured, immutable logs — JSON, write-once, searchable.
④ Track all four metric types — latency, throughput, cost, quality.
⑤ Set SLOs before launch — know your thresholds in advance.
⑥ Circuit breakers for safety — auto-failover on degradation.
⑦ Alert on anomaly, not threshold — detect drift, not just absolutes.
⑧ Use traces for debugging — never guess, always replay.
⑨ On-call training required — observability only works if people use it.
⑩ Ship with observability — don't retrofit after production breaks.

Monitoring &Observability

The Three Pillars of Observability

Logs That Tell a Story

Four Metric Categories

Building the Right Dashboard

Preventing Token Blowouts

Alert on What Matters

Following the Full Journey

Replay & Investigate

Parent-Child Relationships

Catching the Unexpected

Automated Failover

Humans in the Loop

From Prototype to Production

Observability for Compliance

10 Rules of Observability

Monitoring &
Observability