Agent memory and persistence
Chapter 1 made one fact unavoidable: the model is a pure function of its context, and between calls it remembers nothing. The "memory" a chat assistant seems to have is your application re-sending the conversation each turn. That works until the conversation outgrows the window, or until the user comes back next week and expects you to still know their name. At that point re-sending the whole history is either impossible (it does not fit) or wasteful (you pay for thousands of tokens to answer a question that needed three of them).
This chapter is about the durable answer: store the facts outside the window, and on each turn re-inject only the slice that this turn actually needs. That is the memory family from Chapter 1's table, and it relieves the capacity pressure by keeping the context lean no matter how long the relationship runs.
Memory is not the window
The word "memory" gets used for two different things, and conflating them is the usual source of confusion.
Don't be confused. The context window is the model's working memory for one call: a fixed-size buffer, rebuilt every turn, that the model reads and then forgets. An external memory store is your durable record: a database, file, or vector index that lives in your code and survives across calls and sessions. The model never sees the store directly. Your code reads from the store, selects what is relevant, and writes it into the window. Retrieval (pulling a few facts out of the store) is the opposite of "send the whole history" (dumping everything back into the window). The whole skill is keeping the store complete and the window small.
So a memory system has to answer four operational questions, and the rest of this chapter is those four:
- Extraction. When the user says something worth keeping, what exactly do you store?
- Storage and embedding. In what form, so you can find it again later?
- Retrieval. Given a new question, which stored facts go into the window this turn?
- Invalidation. When a new fact contradicts an old one, how do you avoid keeping both?
A memory store in 150 lines
The code below builds the smallest honest version of all four. It uses NumPy and the standard library only: no model call, no database, no network. Read it for the mechanics, not the scale; a production system swaps each piece for a stronger one, but the shape is the same.
Three terms it leans on, defined before you meet them:
- An embedding is a list of numbers that stands in for a piece of text, arranged so that texts about the same thing land near each other in number-space. Real embeddings come from a trained network; ours is a cheap stand-in that hashes each word into one of 256 slots and counts it, so two facts that share words get overlapping vectors.
- Cosine similarity measures how aligned two vectors are: 1.0 means they point the same direction (very similar), 0.0 means they are unrelated. We unit-normalize every vector (scale it to length 1), which makes the cosine just the dot product, $\cos(a,b)=a\cdot b$.
- Top-k retrieval means: score every stored fact against the query, sort, and keep the best $k$. Only those $k$ enter the window, not the whole store.
"""A tiny agent-memory store: extract, embed, store, retrieve, invalidate.
The model is stateless between calls. Anything it should "remember" across
turns or sessions is state YOUR code keeps outside the context window and
re-injects on demand. This file builds the smallest honest version of that:
EXTRACT pull atomic facts from a user turn with simple rules.
EMBED turn a fact into a fixed-length vector with a hashing bag-of-words.
STORE keep (fact, vector) in a list.
RETRIEVE rank stored facts against a query by cosine similarity, top-k.
INVALIDATE when a new fact contradicts an old one on the same subject and
attribute (e.g. "favorite color"), replace the old fact in place.
numpy + stdlib only. No network, no model call. ~1.3 tokens/word for sizing.
"""
import re
import numpy as np
DIM = 256 # length of every embedding vector
# --------------------------------------------------------------------------
# EXTRACT: user turn -> list of normalized atomic facts
# --------------------------------------------------------------------------
# We look for a few first-person and "X is Y" shapes. A real system would use
# the model itself for this; the rules here are deliberately small so you can
# see exactly what becomes a fact and what does not. We split the turn into
# clauses first ("A and B" -> ["A", "B"]) so each fact stays atomic, and the
# value group stops at the first clause break instead of swallowing the rest.
_VALUE = r"([\w ]+?)(?: and |[.,!?]|$)" # a value, ending at a clause boundary
_PATTERNS = [
# "my favorite color is blue" -> "favorite color is blue"
(re.compile(r"\bmy ([\w ]+?) is " + _VALUE, re.I), "{0} is {1}"),
# "i prefer window seats" -> "prefers window seats"
(re.compile(r"\bi prefer " + _VALUE, re.I), "prefers {0}"),
# "i am vegetarian" / "i'm allergic to peanuts" -> "is ..."
(re.compile(r"\bi am " + _VALUE, re.I), "is {0}"),
(re.compile(r"\bi'm " + _VALUE, re.I), "is {0}"),
# "i live in Lisbon" -> "lives in Lisbon"
(re.compile(r"\bi live in " + _VALUE, re.I), "lives in {0}"),
# "i work at Helios" -> "works at Helios"
(re.compile(r"\bi work at " + _VALUE, re.I), "works at {0}"),
]
def extract_facts(turn):
"""Return a list of normalized fact strings found in one user turn."""
facts = []
for pattern, template in _PATTERNS:
for match in pattern.finditer(turn):
groups = [g.strip().lower() for g in match.groups()]
fact = template.format(*groups)
fact = re.sub(r"\s+", " ", fact).strip()
if fact and fact not in facts:
facts.append(fact)
return facts
# --------------------------------------------------------------------------
# EMBED: fact string -> fixed-length unit vector (hashing bag-of-words)
# --------------------------------------------------------------------------
# An "embedding" is a list of numbers that stands in for a piece of text, built
# so that texts about the same thing land near each other. Real embeddings come
# from a trained network. Ours is a cheap stand-in: hash each word to one of DIM
# slots and count it. Shared words -> overlapping slots -> nearby vectors.
_WORD = re.compile(r"[a-z0-9]+")
# A handful of words carry no topic ("i", "is", "my", "the"): drop them so two
# facts are judged similar by their content words, not their grammar.
_STOP = {"i", "is", "am", "my", "the", "a", "an", "and", "to", "in", "of",
"at", "for", "me", "you", "your", "what", "should"}
def _stem(word):
"""Crudely fold a trailing 's' so 'seat' and 'seats' hit the same slot."""
if len(word) > 3 and word.endswith("s") and not word.endswith("ss"):
return word[:-1]
return word
def embed(text):
"""Hash the words of `text` into a DIM-length vector, then unit-normalize."""
vec = np.zeros(DIM, dtype=np.float64)
for word in _WORD.findall(text.lower()):
if word in _STOP:
continue
# Python's hash is salted per process, so use a stable hash instead.
slot = hash_stable(_stem(word)) % DIM
vec[slot] += 1.0
norm = np.linalg.norm(vec)
if norm > 0:
vec /= norm # length 1, so the dot product below is a cosine directly
return vec
def hash_stable(word):
"""A small deterministic string hash (FNV-1a), so runs are reproducible."""
h = 2166136261
for ch in word.encode("utf-8"):
h = (h ^ ch) * 16777619 & 0xFFFFFFFF
return h
def cosine(a, b):
"""Cosine similarity of two vectors: 1.0 identical direction, 0.0 unrelated.
Both vectors are already unit length, so this is just their dot product.
"""
return float(np.dot(a, b))
# --------------------------------------------------------------------------
# The store: facts, their vectors, and the subject/attribute key for updates
# --------------------------------------------------------------------------
class MemoryStore:
def __init__(self):
self.facts = [] # list of fact strings
self.vectors = [] # list of embeddings, aligned with self.facts
self.keys = [] # subject/attribute key, aligned, for invalidation
def _key(self, fact):
"""The part of a fact that identifies WHAT it is about, minus the value.
"favorite color is blue" and "favorite color is green" share the key
"favorite color is", so the second should replace the first, not sit
beside it. We key on the text up to and including the last " is ".
"""
marker = " is "
if marker in fact:
return fact[: fact.rindex(marker) + len(marker)]
# "prefers window seats" -> key "prefers", value "window seats"
parts = fact.split(" ", 1)
return parts[0] if len(parts) > 1 else fact
def add(self, fact):
"""Store a fact, REPLACING any existing fact with the same key."""
key = self._key(fact)
for i, existing_key in enumerate(self.keys):
if existing_key == key:
old = self.facts[i]
self.facts[i] = fact
self.vectors[i] = embed(fact)
return ("updated", old)
self.facts.append(fact)
self.vectors.append(embed(fact))
self.keys.append(key)
return ("added", None)
def retrieve(self, query, k=3):
"""Return the top-k (fact, score) most similar to `query`."""
if not self.facts:
return []
q = embed(query)
scored = [(self.facts[i], cosine(q, self.vectors[i]))
for i in range(len(self.facts))]
scored.sort(key=lambda pair: pair[1], reverse=True)
return scored[:k]
# --------------------------------------------------------------------------
# Sizing helper: estimate tokens the way Chapter 2 does (words * 1.3)
# --------------------------------------------------------------------------
def est_tokens(text):
return round(len(text.split()) * 1.3)
# --------------------------------------------------------------------------
# Demo
# --------------------------------------------------------------------------
def main():
# ---- Session 1: the user tells us things across several turns ----
session_1 = [
"Hi! My name is Dana and I live in Lisbon.",
"I prefer window seats and I am vegetarian.",
"By the way my favorite color is blue.",
"I work at Helios and my budget is 2000 dollars.",
]
store = MemoryStore()
print("=== Session 1: extract facts from each turn and store them ===")
for turn in session_1:
found = extract_facts(turn)
for fact in found:
status, old = store.add(fact)
if status == "updated":
print(f" turn: {turn!r}")
print(f" UPDATED {old!r} -> {fact!r}")
else:
print(f" turn: {turn!r}")
print(f" stored {fact!r}")
print(f"\n memory now holds {len(store.facts)} facts:")
for fact in store.facts:
print(f" - {fact}")
# ---- A new session. The model remembers NOTHING on its own. ----
# The user asks something only answerable from stored memory.
print("\n=== Session 2 (later): a question unanswerable without memory ===")
question = "Which seat do I prefer?"
print(f" user asks: {question!r}")
# BEFORE memory: the only context is this one question. No stored facts.
print("\n -- BEFORE: no memory injected --")
print(" context contains only the question; the model has never")
print(" seen 'Dana', 'window seats', or anything from session 1.")
print(" Best it can do: ask the user to repeat their preference.")
# AFTER memory: retrieve the few relevant facts and inject only those.
print("\n -- AFTER: retrieve top-k relevant memories and inject them --")
hits = store.retrieve(question, k=3)
for fact, score in hits:
print(f" score {score:.3f} {fact}")
injected = "\n".join(f"- {fact}" for fact, _ in hits)
print(" injected memory block:")
for line in injected.splitlines():
print(f" {line}")
print(" With 'prefers window seats' in context, the model answers:")
print(" 'I'll book you a window seat.'")
# ---- The saving: inject everything vs inject only the relevant slice ----
print("\n=== The saving: whole history vs top-k memories ===")
full_history = " ".join(session_1)
full_tok = est_tokens(full_history)
topk_tok = est_tokens(injected)
print(f" inject ALL session-1 history : {full_tok:4d} tokens")
print(f" inject top-{len(hits)} memories : {topk_tok:4d} tokens")
saved = full_tok - topk_tok
pct = 100.0 * saved / full_tok
print(f" saved : {saved:4d} tokens ({pct:.0f}% smaller)")
print(" (and history grows every turn; the top-k slice does not.)")
# ---- Contradiction: retrieval returns the UPDATED fact, not the stale one ----
print("\n=== Contradiction handling: the update wins ===")
color_q = "what is my favorite color?"
before = store.retrieve(color_q, k=1)[0]
print(f" current best answer for {color_q!r}: {before[0]!r}")
print(" user now says: 'Actually my favorite color is green.'")
for fact in extract_facts("Actually my favorite color is green."):
status, old = store.add(fact)
print(f" {status.upper()} {old!r} -> {fact!r}")
after = store.retrieve(color_q, k=1)[0]
print(f" best answer now: {after[0]!r}")
print(f" memory still holds {len(store.facts)} facts (no duplicate color).")
if __name__ == "__main__":
main()
Running it:
=== Session 1: extract facts from each turn and store them ===
turn: 'Hi! My name is Dana and I live in Lisbon.'
stored 'name is dana'
turn: 'Hi! My name is Dana and I live in Lisbon.'
stored 'lives in lisbon'
turn: 'I prefer window seats and I am vegetarian.'
stored 'prefers window seats'
turn: 'I prefer window seats and I am vegetarian.'
stored 'is vegetarian'
turn: 'By the way my favorite color is blue.'
stored 'favorite color is blue'
turn: 'I work at Helios and my budget is 2000 dollars.'
stored 'budget is 2000 dollars'
turn: 'I work at Helios and my budget is 2000 dollars.'
stored 'works at helios'
memory now holds 7 facts:
- name is dana
- lives in lisbon
- prefers window seats
- is vegetarian
- favorite color is blue
- budget is 2000 dollars
- works at helios
=== Session 2 (later): a question unanswerable without memory ===
user asks: 'Which seat do I prefer?'
-- BEFORE: no memory injected --
context contains only the question; the model has never
seen 'Dana', 'window seats', or anything from session 1.
Best it can do: ask the user to repeat their preference.
-- AFTER: retrieve top-k relevant memories and inject them --
score 0.577 prefers window seats
score 0.000 name is dana
score 0.000 lives in lisbon
injected memory block:
- prefers window seats
- name is dana
- lives in lisbon
With 'prefers window seats' in context, the model answers:
'I'll book you a window seat.'
=== The saving: whole history vs top-k memories ===
inject ALL session-1 history : 47 tokens
inject top-3 memories : 16 tokens
saved : 31 tokens (66% smaller)
(and history grows every turn; the top-k slice does not.)
=== Contradiction handling: the update wins ===
current best answer for 'what is my favorite color?': 'favorite color is blue'
user now says: 'Actually my favorite color is green.'
UPDATED 'favorite color is blue' -> 'favorite color is green'
best answer now: 'favorite color is green'
memory still holds 7 facts (no duplicate color).
Read the four sections in order, because each one is a piece of the answer.
Extraction turns chatty turns into atomic facts. The rules are small and rule-based here:
a few patterns for my X is Y, I prefer Y, I am Y, I live in Y, I work at Y. The
turn "My name is Dana and I live in Lisbon" becomes two separate facts, name is dana and
lives in lisbon, not one run-on string, because a fact you can retrieve has to be about one
thing. A real system hands this job to the model itself, asking it to emit clean facts; the
rules above are just enough to make the step concrete.
Retrieval is where the window stays small. In session 2 the user asks "Which seat do I
prefer?" The query shares the word "seat" with the stored fact "prefers window seats" (after
a crude singular/plural fold so seat and seats collide), so that fact scores 0.577 while
the rest score 0.000. The system injects the top 3 and the model can answer. Note what did
not happen: the budget, the employer, and the diet stayed in the store, out of the window,
because this question did not touch them.
The saving is the number that justifies the whole apparatus. Dumping all of session 1 into the window costs 47 tokens here; injecting the 3 retrieved facts costs 16, about two-thirds smaller. On a four-turn toy that is a rounding error. The point is the trend: the full history grows every single turn, so by turn fifty it is hundreds of tokens of mostly irrelevant chatter, while the top-k slice stays roughly constant because it is always "the few facts this question needs." Memory turns a cost that scales with conversation length into one that scales with question complexity.
Invalidation is the part people forget, and forgetting it is how assistants end up
believing two contradictory things. The user said blue, then later said green. A naive store
appends both and retrieval starts returning whichever happens to rank higher, which is a coin
flip. The fix is to give each fact a key: the part that says what the fact is about,
minus the value. Both color facts share the key favorite color is, so the second one
replaces the first in place. After the update the store still holds 7 facts, not 8, and a
query for the favorite color returns green. The key is what lets you tell "a new fact about
the same thing" (update) apart from "a fact about a different thing" (add).
What the real systems do
The toy above is the shape; three open projects are the production versions, and they make different bets about how memory should be structured.
-
Mem0 is the closest to what we built: a vector-first memory layer you bolt onto any agent. It extracts atomic facts from the conversation with an LLM, embeds them, stores them in a vector database, and retrieves the relevant ones with a multi-signal ranking (not just one cosine score). It also does the invalidation step, deciding whether a new fact adds to, updates, or contradicts what is stored. Its pitch is that this beats stuffing the full history into the window, and it reports strong numbers on the LongMemEval long-conversation benchmark to back that up.
-
Letta (the project formerly published as MemGPT) takes an operating-system view. It gives the agent a memory hierarchy: a small core tier always in the window (the agent's persona and the most important facts), a recall tier for recent conversation, and a large archival tier in external storage. The twist is that the agent pages its own memory: it has tools to move information between tiers, reading an archival fact into the window when it needs it and writing one back out when it does not, exactly the way an OS pages memory between RAM and disk. The model manages its own context instead of your code doing it from outside.
-
Zep, built on the Graphiti engine, stores memory as a temporal knowledge graph: facts are nodes and edges, and each one carries a validity window saying when it was true. That extra dimension is the whole point. A vector store answers "what is true now"; a temporal graph also answers "what was true in Q1," because a superseded fact is not deleted, it is marked as valid until the moment it was replaced. When the history of a fact matters, that wins. It is the subject of Chapter 10.
Where do you reach for this? Three patterns recur. Cross-session personalization: the assistant remembers a user's preferences, tone, and past decisions from one visit to the next. Long-running assistants: an agent on a task for hours keeps durable notes outside the window so it does not lose the thread when old turns get evicted. Customer histories: a support agent recalls a specific account's prior tickets and context without re-reading the whole record every message.
With Claude
You can implement memory entirely on your side, the way the demo does: keep the store, run retrieval, and paste the selected facts into the prompt. The Anthropic SDK also offers a memory tool that lets the model drive instead. You declare it on the request,
# Illustrative: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
tools=[{"type": "memory_20250818", "name": "memory"}],
messages=[{"role": "user", "content": "Remember that I prefer window seats."}],
)
and the model can then read and write files in a memory directory across calls. You supply
the storage backend (the tool tells you which file to read, create, or edit; where those
bytes live is your code), which keeps you in control of where durable user data is kept. This
is the Letta idea in miniature: the model manages its own notes rather than your retrieval
code deciding for it. Consult the claude-api skill before wiring it up; the tool name and
backend interface are exact.
Using the real tool: commands and before/after proof
The 150-line store above shows the mechanics. Here is how you reach for the production version. Mem0 is the closest match to what we built, so it is the one to run first.
It is a Python package called mem0ai (the import name is mem0). Install it, create a
Memory, add a couple of facts under a user_id, then search to get back only the relevant
ones. This is follow-along: the library and an embedding key are not on this box, so the
output is labeled as expected, not measured here.
pip install mem0ai
# Follow-along: requires mem0ai and an embedding/LLM key in the environment.
from mem0 import Memory
m = Memory() # default config: an LLM extracts facts, embeds them, stores them in a vector DB
# Session 1: hand Mem0 raw conversation turns. It extracts the atomic facts itself.
m.add(
[
{"role": "user", "content": "Hi, I'm Dana and I live in Lisbon."},
{"role": "user", "content": "I prefer window seats and I am vegetarian."},
],
user_id="dana",
)
# Session 2 (a later call, even a later day): retrieve only what this question needs.
hits = m.search(query="Which seat do I prefer?", filters={"user_id": "dana"}, top_k=3)
for h in hits["results"]:
print(round(h["score"], 3), h["memory"])
0.41 Prefers window seats
0.22 Lives in Lisbon
0.19 Name is Dana
That is the same four-operation shape as the demo, with each piece swapped for a stronger
one. add() does the extraction: you pass whole turns and Mem0's LLM splits them into atomic
facts (Prefers window seats, not the run-on sentence). It embeds and stores them in a vector
database keyed by user_id, and it runs the invalidation step, deciding whether a new fact
adds to or replaces an old one. search() does retrieval: it scores the stored facts against
the query and returns the top few. The filters={"user_id": "dana"} argument is the scoping
key, so one store can hold many users without their facts leaking into each other.
Before and after: the metric is injected-context tokens
The claim worth proving is that retrieval keeps the window small. The metric is the number of injected-context tokens: how many tokens of prior knowledge you paste into the prompt to answer one cross-session question. Count it the way you would in production, with Anthropic's token counter, which returns the exact input-token count for a set of messages.
# Follow-along: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()
question = "Which seat do I prefer?"
# (a) Dump the WHOLE prior history into the prompt.
full_history = "\n".join(prior_turns) # every message from every past session
before = client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": full_history + "\n\n" + question}],
).input_tokens
# (b) Inject only the top-k facts Mem0.search returned.
memories = "\n".join(h["memory"] for h in hits["results"])
after = client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": memories + "\n\n" + question}],
).input_tokens
print(before, "->", after)
3000 -> 60
Those two numbers are illustrative (expected for a relationship with a few dozen past turns, not measured on this box), but the gap is the whole point. Dumping the full history costs roughly 3000 tokens and climbs every session; injecting the three facts Mem0 retrieved costs about 60 and stays flat, because it is always "the few facts this question needs."
The correctness half of the proof is sharper than the token count. Without memory, the
cross-session question is simply unanswerable: a fresh call has never seen "Dana" or "window
seats," so the best the model can do is ask the user to repeat themselves. With memory, the
one fact that matters (Prefers window seats) is in the window, and the model answers
"I'll book you a window seat." That is the same before/after you can run on this box right now:
the from-scratch store demo above prints exactly that flip (score 0.577 prefers window seats, then the booked-seat answer), which is the on-box proof that the Mem0 numbers stand in
for.
The other two libraries install the same way. Letta (the MemGPT project) ships the server as
pip install letta and the client SDK as pip install letta-client; you talk to a running
agent that pages its own memory tiers. Zep is pip install zep-cloud, a hosted temporal graph
you write conversations to and query for the facts that were true at a given time.
In Claude Code / Anthropic
Two more pieces sit on the Anthropic side. The memory tool lets the model do the reading and writing instead of your retrieval code. You declare it on the request,
tools=[{"type": "memory_20250818", "name": "memory"}]
and the model issues read and write commands against a memory directory that you back with storage (the tool says which file to view, create, or edit; where those bytes live is your code). That is the Letta idea in miniature, the model managing its own notes.
A project CLAUDE.md is the other kind: a hand-maintained, long-lived memory. It is not
retrieved per question; it is re-injected in full at the start of every session, so it is the
place for the stable facts about a project that every turn should know. Retrieval keeps the
per-question slice small; a CLAUDE.md keeps the always-on baseline correct.
Takeaways
- The model is stateless between calls, so durable memory is something your code stores outside the window and re-injects a relevant slice of. The store is not the window.
- A memory system is four operations: extract atomic facts from a turn, embed and store them, retrieve the top-k relevant ones for a new query, and invalidate the stale fact when a new one contradicts it.
- Retrieval is what keeps the window small. Injecting the few relevant facts (16 tokens in the demo) instead of the whole history (47, and climbing every turn) is a real and growing saving, because top-k scales with the question, not the conversation length.
- Invalidation needs a key (the subject and attribute of a fact, minus its value) so a new value replaces the old one in place instead of sitting beside it. Without it, retrieval returns contradictory facts at random.
- Mem0 is the vector-first layer you bolt on; Letta gives the agent an OS-style memory hierarchy it pages itself; Zep stores a temporal graph for when "what was true then" matters. Anthropic's memory tool lets the model read and write a directory you back with storage.
👉 The store we built throws away the past: update the favorite color and blue is gone. But plenty of questions are about the past, and answering "what was the budget last quarter" needs a memory that records when each fact was true. The next chapter builds exactly that: a temporal knowledge graph, where facts carry validity windows and nothing is ever silently overwritten.