Agent Engineering — agentlens

Phase 1 of 5 · Why & What

Why Now?

Two disciplines are merging. Data science teams build models; product teams build software. Agent engineering is what you need when they must work together reliably at scale.

The overlap — where you define what the model does, what it can call, and what runs around every call — is the new craft of agent engineering.

Why now? LLMs crossed a capability threshold. They can follow complex instructions, call tools reliably, and reason across many steps. The bottleneck is no longer the model — it's the engineer who knows how to direct it.

Product Eng Data Science Agent Engineering

The Definition

What is Agent Engineering?

You write instructions (what the agent does), skills (what it can call), and hooks (what runs around every call) — not procedural code. The LLM is your runtime.

Agent engineering is the practice of building reliable, observable, and continuously improvable systems where a language model is the decision engine.

Not a data scientist? You don't need to train models. Not a backend engineer? You don't write control flow. Agent engineers define behavior in language and measure it with evals.

instructions skills hooks evals

Shift Your Mental Model

The Engineering Mindset

Treat the LLM as your runtime. Debugging uses traces, not stack traces. Quality is measured with evals, not just unit tests.

Everything you know about software engineering maps to agent engineering — just with different tools for every discipline.

The hardest shift: You can't read the LLM's source code. You can only observe its behavior. Write evals before you write instructions — define "good" before you try to get there.

code → prompt unit tests → evals logs → traces config → instructions

Phase 2 of 5 · The Stack

The Agent Engineering Stack

Every production agent has five layers. Instructions define who it is. Skills define what it can do. Hooks run around every action. Memory persists context. Guardrails keep it safe.

You build all five — not just the instructions. The quality of the full stack determines whether your agent is a demo or a product.

Don't skip layers. Most early agents only have instructions. They fail in production because they lack hooks (no logging), memory (no context), or guardrails (no safety). Build the full stack from the start.

instructions skills hooks memory guardrails

Layer 1: Instructions

Writing Instructions

The system prompt is your agent's constitution. Persona sets the tone. Rules define hard constraints. Few-shot examples show the model what "good" looks like. Format specifies output structure.

A good system prompt is written for the worst-case input, not the average one. Every rule should survive an adversarial user.

Real pattern: Start with persona → add rules → add examples → specify format. Never start with format — it anchors you before you've defined the goal.

Layer 2: Skills

Writing Skills (Tools)

Skills are functions the agent can call. The name and description are what the model reads to decide which skill to use — so they matter more than the implementation. Schema validates the arguments.

Name skills as verbs: search_web, not web. Descriptions should say when to use the skill, not just what it does. The model reads them like a menu.

One skill, one job. A skill that does two things is a skill the model will misuse. If a tool searches AND summarizes, split it. Composition happens in instructions, not in tool implementations.

name description schema validation

Phase 3 of 5 · Build & Test

Writing Hooks

Hooks are middleware. Pre-tool hooks run before every skill call — validate, log, rate-limit, check policy. Post-tool hooks run after — sanitize output, check PII, record cost. On-error hooks handle failures.

A pre-hook can reject a tool call entirely, returning a denial reason that gets added back to the model's context. The model then reasons about the rejection and tries another approach.

Hooks are not optional in production. Without a pre-hook you can't rate-limit. Without a post-hook you can't catch PII leaks. Hooks are where operational requirements live — separate from your business logic.

pre-hook post-hook reject on-error

Layer 4: Memory

Memory Types

Start with in-context (conversation history). Hit the limit → add a vector store for docs. Users keep re-explaining preferences → add episodic memory. Agent repeats mistakes → fix it in procedural (update the system prompt).

Each ring is more expensive than the one inside it. Start at the center and move outward only when you hit a real limit.

The most common mistake: jumping to vector stores before filling the context window. In-context memory is fast, consistent, and free. Only add complexity when the simple version fails.

in-context vector episodic procedural

Testing

Testing Agents

Test at every layer. Unit-test each tool function in isolation. Integration-test the agent with mocked tools. Evals score real outputs against a golden dataset — catch prompt regressions before users do.

The base of the pyramid is the widest: many fast unit tests. Evals at the top are expensive but essential — they're the only way to know if a prompt change improved things or not.

Eval-driven development: Write three evals before changing your system prompt. Run them after. If two improve and one regresses, you have a decision to make — not a guess.

unit tests integration evals

The Loop

The Lifecycle

The loop never stops. Build → Test → Ship → Observe → Refine → Build again.

Real example: Week 1: wrote instructions. Week 2: eval 3.1/5. Week 3: rewrote tone rules → 3.9/5. Week 4: shipped to 10%. Week 5: new failure found in traces → back to Build.

The loop is the product. Your agent isn't done when it ships — it's done when you stop improving it. Teams that invest in the observe and refine steps ship better agents faster than teams that rebuild from scratch.

Phase 4 of 5 · Ship & Observe

Shipping Agents

Version your prompts like code. Roll out gradually — 10% canary first. An automated eval gate blocks promotion if accuracy drops. Rollback is instant because you kept the previous version.

Prompt changes are code changes. They deserve the same rigor: review, testing, staged rollout, and rollback capability.

Never ship without a rollback plan. An agent that hallucinates at 100% traffic is a production incident. A canary at 10% gives you 90% protection while you validate real behavior with real users.

dev staging canary 10% prod 100%

Observability

Observing Agents

Traces show you exactly what happened. Metrics aggregate patterns across thousands of calls. Sparklines show trends over time. User feedback closes the loop.

If you can't see it, you can't fix it. Every agent in production needs traces on day one — not as an afterthought when something breaks.

The trace is your debugger. When an agent does something unexpected, you open the trace and follow the reasoning: which LLM call produced which output, which tool was called with what arguments, where the reasoning went wrong.

traces metrics trends feedback

Continuous Improvement

Refining Agents

Prompt iteration is engineering, not guessing. Form a hypothesis ("the agent hallucinates citations because the instruction doesn't require grounding"). Test it in evals. Ship only if the score improves.

The tightest teams run this cycle in days, not weeks. Each loop adds 0.2–0.5 points to an eval score that users actually feel.

Without evals, refinement is superstition. You think you made it better. With evals, you know. Track score history like a dashboard — every prompt change gets a data point.

Phase 5 of 5 · Govern & Principles

Guardrails

Guardrails are not optional in production. Input validation catches prompt injection and policy violations before the model sees them. Output validation checks for PII, hallucinations, and unsafe content before the user sees the response.

A guardrail that never fires is still doing its job. Its presence deters attempts and provides an audit trail when something slips through.

Layered defense: Input guardrails protect the model. Output guardrails protect the user. Error handling protects the system. You need all three — they catch different failure modes.

input guardrail agent LLM output guardrail

Principles

8 Thumb Rules

Eight rules that experienced agent engineers follow. They seem obvious in hindsight. Burn them in before you start.

① Start simple — Build the dumbest thing that could work first.
② Measure first — Run evals before making any change.
③ One skill, one job — Never let a tool do two things.
④ Write for worst-case — Your instructions survive adversarial users.
⑤ Gate irreversible actions — Confirm before deleting, sending, or paying.
⑥ Budget tokens like money — Every token has a cost and a limit.
⑦ Version everything — Prompts are code; treat them that way.
⑧ Evals are the product — If you can't measure quality, you can't ship safely.

Continue Learning

What's Next?

You can build agents. Two more guides complete the picture — one on how agents coordinate at scale, one on how to measure and continuously improve them.

→ Orchestration Patterns — 10 patterns for multi-agent systems, from Orchestrator-Worker to Human-in-the-Loop, each with a live animated diagram.

→ Evals & Quality — From shipping on vibes to measurable, continuously improving agents. How to write evals, run them in CI, and catch regressions before users do.

orchestration evals multi-agent

A Practical Guide toAgent Engineering

Why Now?

What is Agent Engineering?

The Engineering Mindset

The Agent Engineering Stack

Writing Instructions

Writing Skills (Tools)

Writing Hooks

Memory Types

Testing Agents

The Lifecycle

Shipping Agents

Observing Agents

Refining Agents

Guardrails

8 Thumb Rules

What's Next?

A Practical Guide to
Agent Engineering