From writing your first system prompt to shipping production agents — with animated diagrams for every concept.
Two disciplines are merging. Data science teams build models; product teams build software. Agent engineering is what you need when they must work together reliably at scale.
The overlap — where you define what the model does, what it can call, and what runs around every call — is the new craft of agent engineering.
Product Eng Data Science Agent Engineering
You write instructions (what the agent does), skills (what it can call), and hooks (what runs around every call) — not procedural code. The LLM is your runtime.
Agent engineering is the practice of building reliable, observable, and continuously improvable systems where a language model is the decision engine.
instructions skills hooks evals
Treat the LLM as your runtime. Debugging uses traces, not stack traces. Quality is measured with evals, not just unit tests.
Everything you know about software engineering maps to agent engineering — just with different tools for every discipline.
code → prompt unit tests → evals logs → traces config → instructions
Every production agent has five layers. Instructions define who it is. Skills define what it can do. Hooks run around every action. Memory persists context. Guardrails keep it safe.
You build all five — not just the instructions. The quality of the full stack determines whether your agent is a demo or a product.
instructions skills hooks memory guardrails
The system prompt is your agent's constitution. Persona sets the tone. Rules define hard constraints. Few-shot examples show the model what "good" looks like. Format specifies output structure.
A good system prompt is written for the worst-case input, not the average one. Every rule should survive an adversarial user.
Skills are functions the agent can call. The name and description are what the model reads to decide which skill to use — so they matter more than the implementation. Schema validates the arguments.
Name skills as verbs: search_web, not web. Descriptions should say when to use the skill, not just what it does. The model reads them like a menu.
name description schema validation
Hooks are middleware. Pre-tool hooks run before every skill call — validate, log, rate-limit, check policy. Post-tool hooks run after — sanitize output, check PII, record cost. On-error hooks handle failures.
A pre-hook can reject a tool call entirely, returning a denial reason that gets added back to the model's context. The model then reasons about the rejection and tries another approach.
pre-hook post-hook reject on-error
Start with in-context (conversation history). Hit the limit → add a vector store for docs. Users keep re-explaining preferences → add episodic memory. Agent repeats mistakes → fix it in procedural (update the system prompt).
Each ring is more expensive than the one inside it. Start at the center and move outward only when you hit a real limit.
in-context vector episodic procedural
Test at every layer. Unit-test each tool function in isolation. Integration-test the agent with mocked tools. Evals score real outputs against a golden dataset — catch prompt regressions before users do.
The base of the pyramid is the widest: many fast unit tests. Evals at the top are expensive but essential — they're the only way to know if a prompt change improved things or not.
unit tests integration evals
The loop never stops. Build → Test → Ship → Observe → Refine → Build again.
Real example: Week 1: wrote instructions. Week 2: eval 3.1/5. Week 3: rewrote tone rules → 3.9/5. Week 4: shipped to 10%. Week 5: new failure found in traces → back to Build.
Version your prompts like code. Roll out gradually — 10% canary first. An automated eval gate blocks promotion if accuracy drops. Rollback is instant because you kept the previous version.
Prompt changes are code changes. They deserve the same rigor: review, testing, staged rollout, and rollback capability.
dev staging canary 10% prod 100%
Traces show you exactly what happened. Metrics aggregate patterns across thousands of calls. Sparklines show trends over time. User feedback closes the loop.
If you can't see it, you can't fix it. Every agent in production needs traces on day one — not as an afterthought when something breaks.
traces metrics trends feedback
Prompt iteration is engineering, not guessing. Form a hypothesis ("the agent hallucinates citations because the instruction doesn't require grounding"). Test it in evals. Ship only if the score improves.
The tightest teams run this cycle in days, not weeks. Each loop adds 0.2–0.5 points to an eval score that users actually feel.
Guardrails are not optional in production. Input validation catches prompt injection and policy violations before the model sees them. Output validation checks for PII, hallucinations, and unsafe content before the user sees the response.
A guardrail that never fires is still doing its job. Its presence deters attempts and provides an audit trail when something slips through.
input guardrail agent LLM output guardrail
Eight rules that experienced agent engineers follow. They seem obvious in hindsight. Burn them in before you start.
① Start simple — Build the dumbest thing that could work first.
② Measure first — Run evals before making any change.
③ One skill, one job — Never let a tool do two things.
④ Write for worst-case — Your instructions survive adversarial users.
⑤ Gate irreversible actions — Confirm before deleting, sending, or paying.
⑥ Budget tokens like money — Every token has a cost and a limit.
⑦ Version everything — Prompts are code; treat them that way.
⑧ Evals are the product — If you can't measure quality, you can't ship safely.
You can build agents. Two more guides complete the picture — one on how agents coordinate at scale, one on how to measure and continuously improve them.
orchestration evals multi-agent