Safety & Trust — agentlens

Phase 1 of 5 · The Safety Landscape

Why Agent Safety is Different

A bad LLM response is embarrassing. A bad agent action is damage — an email sent, a file deleted, a payment made, a record modified. Agents have agency: they take actions in the world with real, often irreversible consequences.

Traditional software safety is about preventing crashes. Agent safety is about preventing unintended real-world effects. The threat surface is wider, the blast radius is larger, and the failure modes are more subtle.

The fundamental shift: You're no longer just validating outputs. You're governing actions. An agent that makes a wrong API call can do more damage in 30 seconds than a chatbot can do in 30 days.

bad action irreversible blast radius governance

The Attack Surface

The Threat Model

Five threat categories that agents face: Prompt injection (hostile instructions embedded in user input, tool outputs, or retrieved documents — the #1 attack vector), Jailbreaking (crafted inputs that bypass safety instructions), Indirect injection (attack hidden in content the agent reads), Data poisoning (corrupting the agent's memory or knowledge base), Exfiltration (tricking the agent into revealing secrets).

Indirect injection is the most dangerous. The agent obeys the attacker, not the operator — and the operator may not even know there was an attack. A webpage the agent visits, a document it reads, can contain hidden instructions.

prompt injection jailbreak indirect injection data poisoning

Failure Taxonomy

Three Kinds of Agent Failure

Agents fail in three distinct ways that require distinct defenses. Capability failures: the agent can't do what it's supposed to — wrong answer, wrong tool, incomplete output. Evals (Guide 03) catch these. Alignment failures: the agent does something it shouldn't — follows a malicious instruction, ignores a constraint. These require guardrails and oversight. Availability failures: the agent fails to respond — infinite loops, context overflow, cost explosion.

Most safety engineering focuses only on capability. Alignment failures are harder to detect (the agent succeeded at the wrong goal) and availability failures are often invisible until they hit your bill. All three need separate defenses.

capability alignment availability

Phase 2 of 5 · Input Safety

Prompt Injection — The #1 Attack

Prompt injection is when hostile instructions, embedded in data the agent processes, override the operator's instructions. Direct: the user types "Ignore previous instructions and…" Indirect: a webpage the agent visits contains hidden text instructing it to exfiltrate data.

The defense has three layers: never trust tool outputs (treat retrieved content as data, not instructions), content isolation (use XML tags to clearly delimit operator vs user vs tool content), instruction hierarchy (explicitly state that system prompt takes precedence).

The LLM cannot reliably detect injection attacks in content it's processing. Defense must be structural — isolation, hierarchy, and least privilege — not just a prompt that says "ignore malicious instructions."

direct injection indirect injection content isolation instruction hierarchy

Defense in Depth

Input Guardrails

Guardrails that run before the model sees anything. Three layers: Content classifiers (ML models that detect policy violations, PII, injection patterns — before the input reaches the LLM), Schema validation (structured inputs validated against an expected schema — rejects malformed requests early), Rate limiting and anomaly detection (unusual patterns trigger alerts or blocks).

The key principle: the LLM should be the last line of defense, not the first. If you can catch a bad input without spending a token, you should.

Most teams skip input guardrails and put all safety logic in the system prompt. This is equivalent to building a bank vault with no locks on the front door. Cheap classifiers catching cheap attacks save expensive model calls for real work.

classifier schema validation rate limiting anomaly detection

Permission Architecture

Principle of Least Privilege

Agents should only have access to the tools and data they need for the current task — nothing more. This is the most important architectural safety decision you make. Concrete: a customer support agent doesn't need write access to the database; a research agent doesn't need to send emails; a coding agent doesn't need production credentials.

Least privilege doesn't prevent all attacks, but it drastically reduces the blast radius when an attack succeeds. Scope tools dynamically: for each task, grant only the specific permissions required, then revoke them.

Think in blast radius, not in trust. The question isn't "do I trust this agent?" It's "if this agent is compromised, what's the worst it can do?" Minimize that surface aggressively.

minimal scope dynamic permissions blast radius task-scoped access

Phase 3 of 5 · Output Safety

Output Guardrails

Guardrails that run after the model responds, before anything reaches the user or executes. Four categories: PII detection (names, emails, SSNs — catch before leakage), Hallucination checks (cross-reference claims against retrieved source documents), Policy compliance (content filters, regulatory language requirements), Format validation (if output should be JSON, validate it before delivery).

Output guardrails are not optional in production. They're your last line of defense between the model's output and the real world.

Stack them in order of cost: fast regex/schema checks first, then classifiers, then LLM-based checks. Cheap filters run on everything. Expensive filters run only on what passes the cheap ones.

PII detection hallucination check policy filter format validation

Reversibility Gates

Irreversibility Gating

Before any irreversible action — send an email, delete a record, make a payment, post publicly — pause and require explicit confirmation. The pattern: the agent prepares the action and presents a summary, a human or automated policy engine approves or rejects, only then does the action execute.

Three tiers: Auto-approve (low-risk, reversible — read file, search web), Soft gate (moderate risk — compose draft, stage changes — pause but don't require active approval), Hard gate (irreversible or high-stakes — require explicit human approval every time).

No exceptions to hard gates. The entire point is that hard gates apply regardless of context, urgency, or how confident the agent sounds. An agent that bypasses its own hard gates is unsafe by design.

auto-approve soft gate hard gate approval flow

Restraint by Design

The Minimal Footprint Principle

When in doubt, do less. Prefer reversible actions over irreversible ones. Request only the permissions needed right now. Err on the side of asking for confirmation rather than acting. Leave the smallest possible trace.

This principle is not just about safety — it's about trust. Users and organizations trust agents more when those agents are visibly conservative. An agent that says "I could do X but want to confirm" is more trustworthy than one that silently does X.

Design for restraint from the start. It's much harder to add minimal footprint behavior to an agent that was designed to act boldly. Build restraint in at the architecture level, not as a patch on top of aggressive defaults.

reversible first confirm before act minimal scope visible restraint

Phase 4 of 5 · Human Oversight

Human-in-the-Loop Patterns

Five patterns ordered by oversight intensity: Async review (agent acts, human reviews audit log after), Approval queue (high-risk actions queue for approval before executing), Co-pilot (agent proposes, human decides every action), Interrupt-on-exception (autonomous but pauses when confidence is low), Scheduled checkpoints (runs autonomously, reports at intervals).

Choose based on consequence severity, not on convenience. Interrupt-on-exception is usually the right default — it gives you most of the autonomy benefit while keeping a human in the loop when it matters most.

async review approval queue co-pilot interrupt checkpoints

Production Signals

Monitoring & Alerting in Production

Four signal categories to watch after deployment: Behavioral anomalies (agent using unfamiliar tools or unusual argument patterns — may indicate injection), Cost anomalies (sudden token or API cost spikes — often signals runaway loops or injection attacks), Latency anomalies (excessive response times may indicate context overflow), Output anomalies (eval score drops, policy violation rate increases).

Design halt conditions before launch — know exactly what signal triggers automatic shutdown.

Define your kill switch before you deploy. "If cost per session exceeds $X, halt automatically." "If policy violation rate exceeds Y%, alert on-call." These thresholds should be in code, not in a runbook nobody reads during an incident.

behavioral cost latency output quality

Accountability

Audit Trails

Every agent action should be logged, attributable, and reviewable. The audit trail answers: what did the agent do, when, with what inputs, and what was the result? Required fields: timestamp, session ID, user ID, action type, full input, full output, latency, cost, tool used, arguments passed, response received.

Audit trails are for compliance (GDPR right to explanation), for security incident response (what did the compromised agent do?), and for trust (users can see exactly what the agent did on their behalf).

If you can't replay it, you can't investigate it. Structured JSON logging with full inputs and outputs is non-negotiable for any agent that touches user data or makes external calls. Storage is cheap. Incident reconstruction without logs is not.

structured logging full trace retention policy GDPR

Phase 5 of 5 · Governance & Principles

Organizational Safety Practices

Safety is a team practice, not a solo checklist. Three practices that distinguish safety-mature teams: Red teaming (deliberately try to break your own agent — prompt injection, jailbreak attempts, adversarial inputs — before every major release), Staged rollout with monitoring (10% canary before 100% — catch unexpected behavior with limited blast radius), Incident response planning (know what you'll do before something goes wrong — who gets paged, what's the rollback, how do you communicate).

Red team before you ship, every time. Teams that only red team at launch miss the regressions introduced by subsequent changes. Every prompt change is a new attack surface that needs to be tested.

red team staged rollout incident response rollback plan

Values, Not Just Rules

Alignment by Design

Safety guardrails prevent bad behavior. Alignment makes an agent want to behave well. The difference: a guardrail stops an agent from leaking PII. An aligned agent wouldn't try to leak PII even without the guardrail.

Alignment by design means: clear values in the system prompt (what the agent cares about, not just what it must not do), honest uncertainty (an agent that says "I don't know" is safer than one that confidently hallucinates), minimal footprint by disposition, and corrigibility (the agent actively supports human oversight rather than resisting it).

Corrigibility is a safety feature. An agent that readily defers to human judgment, transparently explains its reasoning, and flags its own uncertainty is fundamentally safer than one that maximizes task completion at all costs.

clear values honest uncertainty corrigibility minimal footprint

Principles

8 Safety Rules

Eight rules that safety-mature teams follow. These compound — violating one weakens all the others.

① Assume breach — Design as if attackers will get in; minimize blast radius.
② Never trust tool outputs — Treat all retrieved content as potentially hostile data.
③ Gate irreversible actions — always — No exceptions, no urgency overrides.
④ Principle of least privilege — Smallest permission set for the current task.
⑤ Defense in depth — Input guardrails, output guardrails, rate limiting, monitoring — every layer.
⑥ Audit everything — If you can't replay it, you can't investigate it.
⑦ Red team before you ship — Find your own injections before attackers do.
⑧ Build for restraint — An agent that does less and asks more is safer than one that acts boldly.

Agent Safety& Trust

Why Agent Safety is Different

The Threat Model

Three Kinds of Agent Failure

Prompt Injection — The #1 Attack

Input Guardrails

Principle of Least Privilege

Output Guardrails

Irreversibility Gating

The Minimal Footprint Principle

Human-in-the-Loop Patterns

Monitoring & Alerting in Production

Audit Trails

Organizational Safety Practices

Alignment by Design

8 Safety Rules

Agent Safety
& Trust