How agents remember, persist, and reason across time — the architecture of agent cognition, from context windows to vector stores.
Every LLM call starts fresh. The model has no persistent memory by default — you rebuild its entire world from scratch on every invocation. The scaffolding you write around those calls is the agent's memory.
Most early agent bugs are actually memory bugs: wrong context, stale data, missing history. A user explains their preferences in turn 1, and by turn 8 the agent has forgotten them entirely.
cold start stale context persistent memory
Memory architecture has four concentric rings, each more persistent and more expensive than the one inside it. In-context is fastest but ephemeral — gone when the session ends. Semantic/vector stores enable fuzzy retrieval from large knowledge bases. Episodic captures specific past interactions. Procedural is knowledge baked into the system prompt itself — most durable, slowest to update.
in-context semantic episodic procedural
Memory is architecture, not an afterthought. Design your memory stack before writing instructions. Ask three questions: What does the agent need to remember within a session? Across sessions? Across users?
The answers map directly to which rings you need. Most agents only need in-context plus one outer ring. Don't over-engineer: a well-managed context window solves more than most teams expect.
within session? across sessions? across users?
The context window is your most valuable real estate. Key decisions: how much conversation history to keep, what system prompt content to include, how to structure injected data. XML beats prose for structured state — the model reads it more reliably.
The window fills fast. If you don't manage it deliberately, the oldest and most important context gets evicted first. Design eviction policies before you hit the limit, not after.
system prompt history retrieved docs current turn
Four strategies for multi-turn history: Full (keep everything — simple, hits the limit fast), Windowed (keep last N turns — predictable, loses early context), Summarized (compress old turns into a rolling summary — requires an extra LLM call but preserves key facts), Semantic (retrieve only the most relevant past turns — complex but powerful).
full windowed summarized semantic
For agents that maintain complex state — task lists, user preferences, workflow progress — embed structured data directly in context using XML or JSON. The model reads it faithfully and updates it predictably.
Treat in-context structured state like a document the model is editing collaboratively with you. Define a schema, inject it on every turn, parse it back out. Pitfall: state that grows unbounded eventually overflows. Design eviction policies from the start.
<state> block in system prompt → model reads + updates → you parse response → persist externally → re-inject next turn.inject read update parse persist
When your agent needs to know things that don't fit in the context window — internal docs, past conversations, knowledge bases — add a vector store. The pattern: embed the user's message, retrieve the top-K most semantically similar chunks, inject them into context.
Three quality levers: embedding model choice, chunk size, and how many chunks to retrieve. Common mistake: chunking too coarsely (whole documents) or too finely (individual sentences). 200-400 token chunks with overlap usually wins.
embed similarity search top-K chunks inject
Retrieval quality determines answer quality. Three techniques that compound: Query rewriting (reformulate the user's question for retrieval before embedding), HyDE (generate a hypothetical ideal answer, embed that to retrieve), Reranking (after top-K, use a cross-encoder to reorder by relevance).
Each step costs a bit more latency. Each one meaningfully improves precision. Stack them incrementally — start with rewriting, add reranking when precision matters most.
raw query rewritten HyDE reranked
Episodic memory stores specific past interactions and retrieves them when relevant. "User said they prefer concise responses." "This customer's previous complaint was about billing." Retrieved and injected into the next session's context.
Implementation: at session end, extract key facts with an LLM → embed and store → at next session start, retrieve top relevant facts → inject. TTL everything — a memory from 6 months ago may no longer reflect reality.
extract facts embed + store retrieve inject
Procedural memory is knowledge encoded in the system prompt itself. When the agent repeatedly makes the same mistake, the fix isn't episodic — it's procedural: update the system prompt to embed the corrected behavior permanently.
This is the most durable form of memory and the cheapest to retrieve (zero cost — it's already in context). Pattern: monitor evals for systematic failure modes → diagnose root cause → encode the fix as a rule → re-run evals. Agent learning without fine-tuning.
failure trace diagnose encode rule eval improvement
When multiple agents collaborate, they need shared state. Two patterns: Blackboard (shared read-write store all agents can access — simple, requires locking) and Message passing (agents communicate via structured messages — more complex, naturally ordered).
Key risks: race conditions (two agents write conflicting state simultaneously), staleness (agent reads superseded state), visibility (agents don't know what others have done). Treat shared memory like a database: define schemas, use locking, log all writes.
blackboard message passing race condition locking
Memories go stale. A user's preferences change. A document gets updated. An episodic memory from three months ago contradicts the current system state. Stale memories cause confident incorrect behavior — often worse than no memory at all.
Strategies: TTL-based expiry, version-based invalidation (memories linked to document versions), confidence decay (memories become less trusted over time), user-facing controls (let users view and delete).
stale TTL expiry version invalidation user controls
Decision guide — which ring to use for which problem: In-context for current session, structured state, recent history. Vector store for large knowledge bases, past conversations needing semantic search. Episodic for user preferences, project context, customer history. Procedural for systematic agent behaviors and learned rules.
Start at the center, move outward only when you hit a real limit. The most common over-engineering mistake: jumping to vector stores before exhausting what a well-managed context window can do.
Memory creates privacy obligations. You're storing user behavior, preferences, and content — often across sessions. GDPR and similar regulations give users rights to access, correct, and delete their data. Design for deletion from day one.
Never store sensitive information (health data, financial data, passwords) in agent memory without explicit consent. Separate memories by user ID. Keep audit logs of what was stored and when. Make per-user purge a first-class operation, not a hotfix.
GDPR per-user isolation right to delete audit log
Eight rules that memory-mature teams follow. Violate them and you'll rediscover why they exist.
① Start in-context — Don't add a vector store until you've hit the window limit.
② Design eviction policies first — Unbounded memory always overflows.
③ TTL everything — Stale memories are worse than no memories.
④ Give users control — View and delete is a trust feature.
⑤ Separate by user — Never let one user's memory leak to another's context.
⑥ Procedural beats episodic — Systematic failures go in the prompt, not a memory store.
⑦ Log every write — You need an audit trail when memory misbehaves.
⑧ Test memory paths — Write evals that specifically test retrieval and injection.