Building agents users can rely on — threat models, guardrails, human oversight, and governance for production AI systems.
A bad LLM response is embarrassing. A bad agent action is damage — an email sent, a file deleted, a payment made, a record modified. Agents have agency: they take actions in the world with real, often irreversible consequences.
Traditional software safety is about preventing crashes. Agent safety is about preventing unintended real-world effects. The threat surface is wider, the blast radius is larger, and the failure modes are more subtle.
bad action irreversible blast radius governance
Five threat categories that agents face: Prompt injection (hostile instructions embedded in user input, tool outputs, or retrieved documents — the #1 attack vector), Jailbreaking (crafted inputs that bypass safety instructions), Indirect injection (attack hidden in content the agent reads), Data poisoning (corrupting the agent's memory or knowledge base), Exfiltration (tricking the agent into revealing secrets).
prompt injection jailbreak indirect injection data poisoning
Agents fail in three distinct ways that require distinct defenses. Capability failures: the agent can't do what it's supposed to — wrong answer, wrong tool, incomplete output. Evals (Guide 03) catch these. Alignment failures: the agent does something it shouldn't — follows a malicious instruction, ignores a constraint. These require guardrails and oversight. Availability failures: the agent fails to respond — infinite loops, context overflow, cost explosion.
capability alignment availability
Prompt injection is when hostile instructions, embedded in data the agent processes, override the operator's instructions. Direct: the user types "Ignore previous instructions and…" Indirect: a webpage the agent visits contains hidden text instructing it to exfiltrate data.
The defense has three layers: never trust tool outputs (treat retrieved content as data, not instructions), content isolation (use XML tags to clearly delimit operator vs user vs tool content), instruction hierarchy (explicitly state that system prompt takes precedence).
direct injection indirect injection content isolation instruction hierarchy
Guardrails that run before the model sees anything. Three layers: Content classifiers (ML models that detect policy violations, PII, injection patterns — before the input reaches the LLM), Schema validation (structured inputs validated against an expected schema — rejects malformed requests early), Rate limiting and anomaly detection (unusual patterns trigger alerts or blocks).
The key principle: the LLM should be the last line of defense, not the first. If you can catch a bad input without spending a token, you should.
classifier schema validation rate limiting anomaly detection
Agents should only have access to the tools and data they need for the current task — nothing more. This is the most important architectural safety decision you make. Concrete: a customer support agent doesn't need write access to the database; a research agent doesn't need to send emails; a coding agent doesn't need production credentials.
Least privilege doesn't prevent all attacks, but it drastically reduces the blast radius when an attack succeeds. Scope tools dynamically: for each task, grant only the specific permissions required, then revoke them.
minimal scope dynamic permissions blast radius task-scoped access
Guardrails that run after the model responds, before anything reaches the user or executes. Four categories: PII detection (names, emails, SSNs — catch before leakage), Hallucination checks (cross-reference claims against retrieved source documents), Policy compliance (content filters, regulatory language requirements), Format validation (if output should be JSON, validate it before delivery).
Output guardrails are not optional in production. They're your last line of defense between the model's output and the real world.
PII detection hallucination check policy filter format validation
Before any irreversible action — send an email, delete a record, make a payment, post publicly — pause and require explicit confirmation. The pattern: the agent prepares the action and presents a summary, a human or automated policy engine approves or rejects, only then does the action execute.
Three tiers: Auto-approve (low-risk, reversible — read file, search web), Soft gate (moderate risk — compose draft, stage changes — pause but don't require active approval), Hard gate (irreversible or high-stakes — require explicit human approval every time).
auto-approve soft gate hard gate approval flow
When in doubt, do less. Prefer reversible actions over irreversible ones. Request only the permissions needed right now. Err on the side of asking for confirmation rather than acting. Leave the smallest possible trace.
This principle is not just about safety — it's about trust. Users and organizations trust agents more when those agents are visibly conservative. An agent that says "I could do X but want to confirm" is more trustworthy than one that silently does X.
reversible first confirm before act minimal scope visible restraint
Five patterns ordered by oversight intensity: Async review (agent acts, human reviews audit log after), Approval queue (high-risk actions queue for approval before executing), Co-pilot (agent proposes, human decides every action), Interrupt-on-exception (autonomous but pauses when confidence is low), Scheduled checkpoints (runs autonomously, reports at intervals).
async review approval queue co-pilot interrupt checkpoints
Four signal categories to watch after deployment: Behavioral anomalies (agent using unfamiliar tools or unusual argument patterns — may indicate injection), Cost anomalies (sudden token or API cost spikes — often signals runaway loops or injection attacks), Latency anomalies (excessive response times may indicate context overflow), Output anomalies (eval score drops, policy violation rate increases).
Design halt conditions before launch — know exactly what signal triggers automatic shutdown.
behavioral cost latency output quality
Every agent action should be logged, attributable, and reviewable. The audit trail answers: what did the agent do, when, with what inputs, and what was the result? Required fields: timestamp, session ID, user ID, action type, full input, full output, latency, cost, tool used, arguments passed, response received.
Audit trails are for compliance (GDPR right to explanation), for security incident response (what did the compromised agent do?), and for trust (users can see exactly what the agent did on their behalf).
structured logging full trace retention policy GDPR
Safety is a team practice, not a solo checklist. Three practices that distinguish safety-mature teams: Red teaming (deliberately try to break your own agent — prompt injection, jailbreak attempts, adversarial inputs — before every major release), Staged rollout with monitoring (10% canary before 100% — catch unexpected behavior with limited blast radius), Incident response planning (know what you'll do before something goes wrong — who gets paged, what's the rollback, how do you communicate).
red team staged rollout incident response rollback plan
Safety guardrails prevent bad behavior. Alignment makes an agent want to behave well. The difference: a guardrail stops an agent from leaking PII. An aligned agent wouldn't try to leak PII even without the guardrail.
Alignment by design means: clear values in the system prompt (what the agent cares about, not just what it must not do), honest uncertainty (an agent that says "I don't know" is safer than one that confidently hallucinates), minimal footprint by disposition, and corrigibility (the agent actively supports human oversight rather than resisting it).
clear values honest uncertainty corrigibility minimal footprint
Eight rules that safety-mature teams follow. These compound — violating one weakens all the others.
① Assume breach — Design as if attackers will get in; minimize blast radius.
② Never trust tool outputs — Treat all retrieved content as potentially hostile data.
③ Gate irreversible actions — always — No exceptions, no urgency overrides.
④ Principle of least privilege — Smallest permission set for the current task.
⑤ Defense in depth — Input guardrails, output guardrails, rate limiting, monitoring — every layer.
⑥ Audit everything — If you can't replay it, you can't investigate it.
⑦ Red team before you ship — Find your own injections before attackers do.
⑧ Build for restraint — An agent that does less and asks more is safer than one that acts boldly.