agentlens | ← All Guides | Why Measure? 15 concepts · 5 phases Guide 04 →
Guide 03 of 9  ·  How do I know it's good?

Agent Evals
& Quality

From shipping on vibes to measuring, testing, and continuously improving AI agents — with animated diagrams for every concept.

✓ No evals experience needed 15 concepts 5 quality phases
Scroll to explore
Part 1
The Measurement Problem
Phase 1 of 5 · The Measurement Problem

Why Measure?

Every engineer who ships an agent eventually ships a bad one. Not because the model is broken — but because they couldn't tell the difference between a good output and a bad one until a user did.

Evals are how you define "good" before you ship, catch regressions before users do, and turn prompt engineering from guesswork into engineering.

The core problem: Agents are non-deterministic. The same input produces different outputs on different runs. Without evals you can't tell if a change helped, hurt, or did nothing. You're flying blind.

ship by vibes ship by evals

The Definition

What is an Eval?

An eval is a repeatable test that scores agent behavior. It has three parts: an input (the prompt or scenario), an expected (what good looks like), and a scorer (the function that compares actual to expected).

You run hundreds of evals at once. The aggregate score tells you whether the agent improved, regressed, or stayed flat — across a representative slice of real inputs.

Evals ≠ unit tests. Unit tests check deterministic code. Evals check probabilistic behavior — the score is a distribution, not a pass/fail. An agent with 4.1/5.0 is better than one with 3.8/5.0, even if both "pass."

input expected scorer score

Shift Your Mental Model

The Eval Mindset

Eval-driven development: write evals before you write the prompt. Define what "good" looks like first. Then try to get there. This forces you to be specific about your goals before you can drift.

Every prompt change is a hypothesis: "I think this change will improve accuracy on citation tasks." Evals prove or disprove it. No hypothesis, no change.

The hardest shift: Accepting that your intuition is unreliable. The model often surprises you — a change that "feels" better sometimes scores worse. Evals are the only source of truth.

hypothesis eval measure decide

Part 2
The Eval Stack
Phase 2 of 5 · The Eval Stack

Four Eval Types

Like software testing, evals form a pyramid. Unit evals test a single tool or function in isolation — fast and cheap. Integration evals test agent sub-flows. End-to-end evals score a complete interaction. Human evals catch what automation misses.

Run more at the bottom, fewer at the top. The pyramid shape is intentional: cheap tests give you fast feedback; expensive tests give you ground truth.

Don't skip the base. Teams that only do human evals get ground truth — slowly, expensively, and too late to catch regressions. Automate the base of the pyramid first.

unit integration e2e human

The Foundation

Golden Datasets

A golden dataset is a curated set of (input, expected output) pairs that represent the full range of your agent's use cases. Easy cases, hard cases, edge cases, adversarial inputs. It's the reference your scorer compares against.

The dataset is a living artifact. Add new cases every time a user reports an unexpected failure. After 50 runs, you have a regression suite that catches real production issues before they ship.

Start with 20 examples. Not 2000. A small, well-curated golden set catches most regressions. Add real failures as you discover them. Quality beats quantity every time.

easy cases edge cases adversarial

How Scoring Works

Scorers & Judges

Three scorer types exist. Rule-based scorers are exact — regex, JSON schema validation, substring match. Use them when output format is deterministic. Model-based (LLM-as-Judge) scorers evaluate semantics — quality, helpfulness, tone. Human scorers are the ground truth for ambiguous cases.

Stack them: rule-based catches format failures cheaply; LLM judges evaluate quality at scale; human review validates your judge calibration.

Calibrate your judge. An LLM judge that consistently scores 4.8/5.0 isn't useful — you can't see improvement. Run human labels on 50 examples and check the judge agrees. If correlation is below 0.8, fix the rubric first.

rule-based LLM judge human

Part 3
Writing Evals
Phase 3 of 5 · Writing Evals

Unit Evals

A unit eval tests a single skill or tool call. Give it a specific input, run it in isolation, compare the actual output to an expected schema or value. Fast, deterministic, and cheap to run in CI.

Unit evals catch: wrong tool selected, malformed arguments, broken output schema, unexpected error paths. They don't catch reasoning failures — that's what integration evals are for.

Write one unit eval per edge case, not per feature. The goal is coverage of the input space, not coverage of the code. Ask: "what inputs would break this tool?" Then write evals for those.

input tool call actual output schema check

Holistic Testing

End-to-End Evals

An end-to-end eval runs a complete multi-turn interaction — user message, agent response, tool calls, follow-ups — and scores the final result holistically. It's the closest thing to a real user experience.

E2E evals catch: reasoning failures, wrong tool sequences, coherence breakdowns, context loss between turns. They're expensive to run but irreplaceable for catching production issues before they reach users.

Trace every e2e eval. When an e2e eval fails, you need the full trace to understand why. Log every LLM call, every tool invocation, every intermediate result. Debugging without traces is guessing.

multi-turn tool sequence holistic score

Semantic Scoring

LLM-as-Judge

An LLM judge receives the agent's output plus a rubric and returns a structured score. Rubric dimensions might include: factual accuracy, helpfulness, conciseness, safety, and tone. Each dimension gets a score and a rationale.

The judge is itself a prompt — and it can be improved. A better rubric produces more consistent, more accurate scores. Treat your judge prompt with the same rigor as your agent prompt.

Use a stronger model as judge. If your agent runs on Claude Haiku, judge with Claude Sonnet. The judge needs to be capable of catching the agent's errors — it can't if it makes the same errors. The cost is worth it.

rubric score per dimension rationale

Part 4
Running & Interpreting
Phase 4 of 5 · Running & Interpreting

Eval Infrastructure

Evals that run manually are evals that don't run. Wire evals into CI/CD. Every prompt change triggers the eval suite. A score drop below your threshold blocks deployment. Green evals merge automatically.

The eval gate is your quality contract. It doesn't need to be perfect — it needs to run consistently. An 80% reliable gate is infinitely better than no gate.

Three-layer pipeline: Fast unit evals run on every PR (seconds). Integration evals run on merge to main (minutes). Full e2e evals run nightly (hours). You get fast feedback on small changes and deep validation before release.

PR trigger eval runner gate block/deploy

Interpretation

Reading Results

An overall score is a headline — drill into the dimensions. Which categories regressed? Which improved? A prompt change that lifts accuracy by 0.3 but drops safety by 0.4 is not an improvement.

Track score history like a dashboard. Every eval run gets a data point. After 20 runs you'll see the trend: are you improving, plateauing, or oscillating? Oscillation means your prompt changes aren't grounded in the data.

Watch the tail. Average scores can hide catastrophic failures. A 4.1/5.0 average with a 5% failure rate at 0.0/5.0 is a serious production risk. Always check the score distribution, not just the mean.

per-category trend tail failures

Catching Regressions

Regression Detection

A regression is when a previously passing eval now fails. Regressions are the most important signal in agent development. They mean something that worked is broken — and you know exactly when it broke (the last prompt change).

Set a regression threshold: if any category drops by more than 0.2 points versus the baseline, block the change. Tight thresholds catch small regressions early, before they compound into large failures in production.

Regressions in style, not just accuracy. Agents regress on tone, format, verbosity, and language consistency just as often as on factual accuracy. Include stylistic dimensions in your rubric or style regressions will slip through.

regression threshold baseline

Part 5
The Quality Loop
Phase 5 of 5 · The Quality Loop

Eval-Driven Development

The loop is: Observe → Hypothesize → Eval → Change → Measure → Repeat. You observe a failure mode (from traces, user feedback, or a failing eval). You hypothesize a cause. You write a new eval that captures it. You change the prompt. You measure whether the score improved.

This is tight. The best teams run this loop in a day. Most teams run it in a week. The difference is eval infrastructure — teams with fast, reliable evals iterate faster.

The loop is compounding. Each pass through the loop produces a better agent AND a better eval suite. After 20 loops, you have both a high-quality agent and a test suite that protects it. The first loop is the hardest. The tenth is fast.
Principles

8 Eval Rules

Eight rules experienced teams follow. Violate them and you'll rediscover why they exist.

① Define before you build — Write evals before writing the prompt.
② Start with 20 examples — Not 2000. Quality beats quantity.
③ Calibrate your judge — Validate LLM judges against human labels.
④ Gate on score, not on pass/fail — Scores are distributions.
⑤ Never delete failing evals — They represent real failures you fixed.
⑥ Watch the tail — Average scores hide catastrophic failures.
⑦ Add one eval per bug — Every production failure gets a regression test.
⑧ Evals are the product — A high-quality eval suite is your competitive moat.

Continue Learning

What's Next?

You can measure agents. Now combine this with the engineering and orchestration foundations to ship reliable, continuously improving systems at scale.

The three guides form a complete picture: Orchestration (how agents coordinate), Engineering (how to build them), Evals (how to know they're good).

Orchestration Patterns — 10 animated patterns for multi-agent systems. When a single agent can't do the job.
Agent Engineering — From writing your first system prompt to shipping production agents.