From shipping on vibes to measuring, testing, and continuously improving AI agents — with animated diagrams for every concept.
Every engineer who ships an agent eventually ships a bad one. Not because the model is broken — but because they couldn't tell the difference between a good output and a bad one until a user did.
Evals are how you define "good" before you ship, catch regressions before users do, and turn prompt engineering from guesswork into engineering.
ship by vibes ship by evals
An eval is a repeatable test that scores agent behavior. It has three parts: an input (the prompt or scenario), an expected (what good looks like), and a scorer (the function that compares actual to expected).
You run hundreds of evals at once. The aggregate score tells you whether the agent improved, regressed, or stayed flat — across a representative slice of real inputs.
input expected scorer score
Eval-driven development: write evals before you write the prompt. Define what "good" looks like first. Then try to get there. This forces you to be specific about your goals before you can drift.
Every prompt change is a hypothesis: "I think this change will improve accuracy on citation tasks." Evals prove or disprove it. No hypothesis, no change.
hypothesis eval measure decide
Like software testing, evals form a pyramid. Unit evals test a single tool or function in isolation — fast and cheap. Integration evals test agent sub-flows. End-to-end evals score a complete interaction. Human evals catch what automation misses.
Run more at the bottom, fewer at the top. The pyramid shape is intentional: cheap tests give you fast feedback; expensive tests give you ground truth.
unit integration e2e human
A golden dataset is a curated set of (input, expected output) pairs that represent the full range of your agent's use cases. Easy cases, hard cases, edge cases, adversarial inputs. It's the reference your scorer compares against.
The dataset is a living artifact. Add new cases every time a user reports an unexpected failure. After 50 runs, you have a regression suite that catches real production issues before they ship.
easy cases edge cases adversarial
Three scorer types exist. Rule-based scorers are exact — regex, JSON schema validation, substring match. Use them when output format is deterministic. Model-based (LLM-as-Judge) scorers evaluate semantics — quality, helpfulness, tone. Human scorers are the ground truth for ambiguous cases.
Stack them: rule-based catches format failures cheaply; LLM judges evaluate quality at scale; human review validates your judge calibration.
rule-based LLM judge human
A unit eval tests a single skill or tool call. Give it a specific input, run it in isolation, compare the actual output to an expected schema or value. Fast, deterministic, and cheap to run in CI.
Unit evals catch: wrong tool selected, malformed arguments, broken output schema, unexpected error paths. They don't catch reasoning failures — that's what integration evals are for.
input tool call actual output schema check
An end-to-end eval runs a complete multi-turn interaction — user message, agent response, tool calls, follow-ups — and scores the final result holistically. It's the closest thing to a real user experience.
E2E evals catch: reasoning failures, wrong tool sequences, coherence breakdowns, context loss between turns. They're expensive to run but irreplaceable for catching production issues before they reach users.
multi-turn tool sequence holistic score
An LLM judge receives the agent's output plus a rubric and returns a structured score. Rubric dimensions might include: factual accuracy, helpfulness, conciseness, safety, and tone. Each dimension gets a score and a rationale.
The judge is itself a prompt — and it can be improved. A better rubric produces more consistent, more accurate scores. Treat your judge prompt with the same rigor as your agent prompt.
rubric score per dimension rationale
Evals that run manually are evals that don't run. Wire evals into CI/CD. Every prompt change triggers the eval suite. A score drop below your threshold blocks deployment. Green evals merge automatically.
The eval gate is your quality contract. It doesn't need to be perfect — it needs to run consistently. An 80% reliable gate is infinitely better than no gate.
PR trigger eval runner gate block/deploy
An overall score is a headline — drill into the dimensions. Which categories regressed? Which improved? A prompt change that lifts accuracy by 0.3 but drops safety by 0.4 is not an improvement.
Track score history like a dashboard. Every eval run gets a data point. After 20 runs you'll see the trend: are you improving, plateauing, or oscillating? Oscillation means your prompt changes aren't grounded in the data.
per-category trend tail failures
A regression is when a previously passing eval now fails. Regressions are the most important signal in agent development. They mean something that worked is broken — and you know exactly when it broke (the last prompt change).
Set a regression threshold: if any category drops by more than 0.2 points versus the baseline, block the change. Tight thresholds catch small regressions early, before they compound into large failures in production.
regression threshold baseline
The loop is: Observe → Hypothesize → Eval → Change → Measure → Repeat. You observe a failure mode (from traces, user feedback, or a failing eval). You hypothesize a cause. You write a new eval that captures it. You change the prompt. You measure whether the score improved.
This is tight. The best teams run this loop in a day. Most teams run it in a week. The difference is eval infrastructure — teams with fast, reliable evals iterate faster.
Eight rules experienced teams follow. Violate them and you'll rediscover why they exist.
① Define before you build — Write evals before writing the prompt.
② Start with 20 examples — Not 2000. Quality beats quantity.
③ Calibrate your judge — Validate LLM judges against human labels.
④ Gate on score, not on pass/fail — Scores are distributions.
⑤ Never delete failing evals — They represent real failures you fixed.
⑥ Watch the tail — Average scores hide catastrophic failures.
⑦ Add one eval per bug — Every production failure gets a regression test.
⑧ Evals are the product — A high-quality eval suite is your competitive moat.
You can measure agents. Now combine this with the engineering and orchestration foundations to ship reliable, continuously improving systems at scale.
The three guides form a complete picture: Orchestration (how agents coordinate), Engineering (how to build them), Evals (how to know they're good).