Failure and procedural learning

Chapter 9 gave the agent a memory for facts: the user lives in Lisbon, prefers window seats. That memory makes the agent better informed. It does not make the agent better behaved. An agent can know every fact about your project and still commit without being asked, ship a chapter with a broken link, or call an LLM with the wrong model id, again and again, because nothing it knows changes what it does.

This chapter is about the other kind of memory: rules of conduct. We will mine an agent's past sessions for the mistakes it keeps repeating, turn those mistakes into instructions, and write the instructions back into the agent's system prompt so the same mistake does not happen a fourth time. The jargon for this is procedural memory, and learning it from observed failures is procedural learning.

Two kinds of memory

A memory system can store two very different things, and the distinction is the whole point of this chapter.

Semantic memory is facts the agent knows. "The user's budget is 2000 dollars." "This repo deploys to Cloudflare Pages." You retrieve a fact and put it in the context so the model can use it as information. That was Chapter 9.
Procedural memory is how-to rules that change the agent's behavior. "Run the tests before reporting done." "Never commit unless asked." You do not retrieve these per question; they live in the system prompt and apply to every call, steering what the agent does rather than telling it a fact.

There is a third category you will see named in the libraries, episodic memory, which stores whole past episodes (a full transcript of a previous session) to recall as a worked example. We touched it in passing with few-shot retrieval; this chapter is specifically about procedural memory, because that is the one you learn from failure and write into instructions.

Don't be confused. Semantic memory and procedural memory are not two flavors of the same thing. Semantic memory answers "what does the agent know?" and is retrieved into the context as data: a fact sits there inertly until a question needs it. Procedural memory answers "how does the agent act?" and is injected as instruction: a rule changes behavior on every call whether or not anyone asks about it. Storing "never commit without asking" as a retrievable fact would be a mistake, because the agent would only see it when something reminded it to look, which is exactly when it is too late. A rule of conduct has to live in the system prompt, always on. Facts are recalled; rules are obeyed.

The instruction file is procedural memory you wrote by hand

You have already met procedural memory, you just did not call it that. The instruction file at the root of a coding project (CLAUDE.md, or AGENTS.md for some tools) is exactly a block of how-to rules that gets prepended to the agent's system prompt on every call. It is not facts about the code; it is conduct: how to behave while working here.

This very repository's CLAUDE.md is a worked example of procedural memory. It says to build all six books before pushing, because someone once pushed a broken SUMMARY.md link and took down the whole deploy. It says no em dashes in prose, because that is a reliable AI tell and reviewers kept flagging it. It says to humanize prose before committing, and to end commit messages with specific trailers. Every one of those rules is a lesson learned from a past failure, written down so the next session does not relearn it the hard way. That file was assembled by hand, one painful mistake at a time.

Procedural learning automates that loop. Instead of a human noticing "the agent keeps forgetting to run tests" and editing the instruction file, the system mines the agent's own session history, finds the recurring mistakes, and proposes the rules itself. The hand-built CLAUDE.md is the target shape; the question is how to grow one from observed behavior.

The pipeline: mine, synthesize, evaluate

The mechanism is small and has three steps.

Mine. Collect past sessions, each labeled with an outcome (success or failure) and, for failures, a tag naming what went wrong: forgot_to_run_tests, committed_without_asking, used_wrong_model_id. Count the tags. A tag that shows up once is probably an accident; a tag that shows up repeatedly is a pattern worth a rule. A frequency threshold separates the two.
Synthesize. Turn each recurring tag into an imperative rule string, and append those rules to the instruction file. The "before" file is the short base; the "after" file is the base plus the newly learned rules.
Evaluate. Take a held-out set of failures the rule-writer never saw, and ask: how many of these would the new rules have caught? A rule "catches" a failure whose tag it addresses. The drop from the before count to the after count is the improvement.

Here is the whole pipeline as runnable code. The session tags are written out explicitly so you can see exactly what becomes a rule and what does not; a production system would tag transcripts automatically (a classifier, or the model itself reading each transcript), but the logic downstream is identical.

"""Procedural learning: mine past failures, rewrite the agent's instructions.

An agent's INSTRUCTION FILE (CLAUDE.md / AGENTS.md) is part of the system prompt:
it is re-injected on every call and shapes how the agent behaves. SEMANTIC memory
stores FACTS the agent knows (chapter 9). PROCEDURAL memory stores HOW-TO RULES that
change what the agent does. This file builds the smallest honest pipeline that turns
observed mistakes into new procedural rules:

  TRACES     past sessions, each with an outcome (success/failure). A failure
             carries a TAG naming the mistake, e.g. "forgot_to_run_tests".
  MINE       count the failure tags; a tag is RECURRING if it appears at least
             THRESHOLD times. One-off failures are noise; repeats are a pattern.
  SYNTHESIZE map each recurring tag to an imperative rule string, and append the
             new rules to a short base instruction file. Show BEFORE and AFTER.
  EVALUATE   on a HELD-OUT set of failure traces the rules never saw, count how
             many the new rules would have CAUGHT (a rule catches a failure whose
             tag it addresses). Print BEFORE vs AFTER failure counts.

Python stdlib only. No network, no model call.
"""

from collections import Counter

# --------------------------------------------------------------------------
# A trace is one past session. We keep only what mining needs: the outcome,
# and (for failures) the tag naming WHAT went wrong. A real system would tag
# transcripts automatically (a classifier, or the model itself reading the
# transcript); here the tags are explicit so you can see the whole pipeline.
# --------------------------------------------------------------------------
def trace(outcome, tag=None):
    return {"outcome": outcome, "tag": tag}


# The training traces: sessions we have already seen and labeled. The failures
# repeat a few distinct mistakes, plus one rare one-off (a flaky network).
TRAIN = [
    trace("success"),
    trace("failure", "forgot_to_run_tests"),
    trace("success"),
    trace("failure", "committed_without_asking"),
    trace("failure", "forgot_to_run_tests"),
    trace("failure", "used_wrong_model_id"),
    trace("success"),
    trace("failure", "forgot_to_run_tests"),
    trace("failure", "skipped_link_check"),
    trace("failure", "committed_without_asking"),
    trace("failure", "used_wrong_model_id"),
    trace("failure", "forgot_to_run_tests"),
    trace("success"),
    trace("failure", "flaky_network_once"),
    trace("failure", "committed_without_asking"),
    trace("failure", "skipped_link_check"),
]

# Held-out failures: sessions the rule-writer never saw. We use these to ask an
# honest question: of failures we did NOT learn from, how many would the new
# rules have prevented? (Evaluating on the training traces would flatter us.)
HELD_OUT = [
    trace("failure", "forgot_to_run_tests"),
    trace("failure", "committed_without_asking"),
    trace("failure", "skipped_link_check"),
    trace("failure", "used_wrong_model_id"),
    trace("failure", "forgot_to_run_tests"),
    trace("failure", "typo_in_filename_once"),  # a one-off, not a learned rule
]

# The synthesis step: each mistake tag maps to one imperative rule. In a real
# system the model would draft the rule text from clusters of failing
# transcripts; the mapping below is the same idea, frozen so the demo is exact.
TAG_TO_RULE = {
    "forgot_to_run_tests":      "Always run the test suite and confirm it is green before reporting a task done.",
    "committed_without_asking": "Never commit or push unless the user explicitly asks; stage the work and wait.",
    "used_wrong_model_id":      "Use the model id 'claude-opus-4-8'; check the model id before sending any LLM call.",
    "skipped_link_check":       "Validate every cross-link and include path builds before claiming a chapter is done.",
}

THRESHOLD = 2  # a tag must recur at least this many times to earn a rule

BASE_INSTRUCTIONS = """\
# Agent instructions (base)
- Explain concepts from first principles; assume no background.
- Prefer small, runnable examples over prose."""


# --------------------------------------------------------------------------
# MINE: count failure tags, keep the ones at or above THRESHOLD.
# --------------------------------------------------------------------------
def mine_failures(traces, threshold):
    """Return (Counter of all failure tags, sorted list of recurring tags)."""
    tags = Counter(t["tag"] for t in traces if t["outcome"] == "failure")
    # Sort recurring tags by frequency (desc), then name, so output is stable.
    recurring = sorted(
        (tag for tag, n in tags.items() if n >= threshold),
        key=lambda tag: (-tags[tag], tag),
    )
    return tags, recurring


# --------------------------------------------------------------------------
# SYNTHESIZE: turn recurring tags into rule lines and append to the base file.
# --------------------------------------------------------------------------
def synthesize_rules(recurring):
    """Map recurring tags to imperative rule strings (skip tags we can't phrase)."""
    return [TAG_TO_RULE[tag] for tag in recurring if tag in TAG_TO_RULE]


def rewrite_instructions(base, rules):
    """Append a learned-rules section to the base instruction file."""
    if not rules:
        return base
    learned = "\n".join(f"- {rule}" for rule in rules)
    return f"{base}\n\n# Learned rules (mined from past failures)\n{learned}"


# --------------------------------------------------------------------------
# EVALUATE: a rule "catches" a held-out failure whose tag it addresses.
# --------------------------------------------------------------------------
def covered_tags(rules):
    """The set of failure tags the learned rules address (reverse the mapping)."""
    rule_to_tag = {rule: tag for tag, rule in TAG_TO_RULE.items()}
    return {rule_to_tag[rule] for rule in rules}


def evaluate(held_out, rules):
    """Count held-out failures the rules would have prevented vs not."""
    caught_tags = covered_tags(rules)
    failures = [t for t in held_out if t["outcome"] == "failure"]
    prevented = [t for t in failures if t["tag"] in caught_tags]
    remaining = [t for t in failures if t["tag"] not in caught_tags]
    return failures, prevented, remaining


# --------------------------------------------------------------------------
# Demo
# --------------------------------------------------------------------------
def main():
    # ---- MINE ----
    print("=== 1. Mine the past failures ===")
    n_fail = sum(1 for t in TRAIN if t["outcome"] == "failure")
    print(f"  {len(TRAIN)} training sessions, {n_fail} of them failures.")
    tags, recurring = mine_failures(TRAIN, THRESHOLD)
    print(f"  failure tags by frequency (threshold to act = {THRESHOLD}):")
    for tag, n in tags.most_common():
        mark = "RECURRING" if n >= THRESHOLD else "one-off  "
        print(f"    {n:>2}x  {mark}  {tag}")
    print(f"  -> recurring tags worth a rule: {recurring}")

    # ---- SYNTHESIZE ----
    rules = synthesize_rules(recurring)
    before_file = BASE_INSTRUCTIONS
    after_file = rewrite_instructions(BASE_INSTRUCTIONS, rules)

    print("\n=== 2. The instruction file BEFORE (base only) ===")
    for line in before_file.splitlines():
        print(f"  | {line}")

    print("\n=== 3. The instruction file AFTER (base + learned rules) ===")
    for line in after_file.splitlines():
        print(f"  | {line}")
    print(f"  ({len(rules)} new rule(s) appended from mined failures.)")

    # ---- EVALUATE ----
    print("\n=== 4. Evaluate on HELD-OUT failures the rules never saw ===")
    failures, prevented, remaining = evaluate(HELD_OUT, rules)
    print(f"  held-out failures: {len(failures)}")
    print("  with the BEFORE instructions (no learned rules):")
    print(f"    failures prevented: 0 / {len(failures)}")
    print("  with the AFTER instructions (learned rules in the system prompt):")
    print(f"    failures prevented: {len(prevented)} / {len(failures)}")
    for t in prevented:
        print(f"      caught   {t['tag']}")
    for t in remaining:
        print(f"      missed   {t['tag']}  (no rule: one-off or below threshold)")

    drop = len(prevented)
    print(f"\n  improvement: {len(failures)} repeat-class failures -> "
          f"{len(failures) - drop} after learning "
          f"({drop} prevented by the new rules).")
    print("  The one-off that remains is correct to leave alone: procedural")
    print("  learning codifies PATTERNS, not every single accident.")


if __name__ == "__main__":
    main()

Running it:

=== 1. Mine the past failures ===
  16 training sessions, 12 of them failures.
  failure tags by frequency (threshold to act = 2):
     4x  RECURRING  forgot_to_run_tests
     3x  RECURRING  committed_without_asking
     2x  RECURRING  used_wrong_model_id
     2x  RECURRING  skipped_link_check
     1x  one-off    flaky_network_once
  -> recurring tags worth a rule: ['forgot_to_run_tests', 'committed_without_asking', 'skipped_link_check', 'used_wrong_model_id']

=== 2. The instruction file BEFORE (base only) ===
  | # Agent instructions (base)
  | - Explain concepts from first principles; assume no background.
  | - Prefer small, runnable examples over prose.

=== 3. The instruction file AFTER (base + learned rules) ===
  | # Agent instructions (base)
  | - Explain concepts from first principles; assume no background.
  | - Prefer small, runnable examples over prose.
  | 
  | # Learned rules (mined from past failures)
  | - Always run the test suite and confirm it is green before reporting a task done.
  | - Never commit or push unless the user explicitly asks; stage the work and wait.
  | - Validate every cross-link and include path builds before claiming a chapter is done.
  | - Use the model id 'claude-opus-4-8'; check the model id before sending any LLM call.
  (4 new rule(s) appended from mined failures.)

=== 4. Evaluate on HELD-OUT failures the rules never saw ===
  held-out failures: 6
  with the BEFORE instructions (no learned rules):
    failures prevented: 0 / 6
  with the AFTER instructions (learned rules in the system prompt):
    failures prevented: 5 / 6
      caught   forgot_to_run_tests
      caught   committed_without_asking
      caught   skipped_link_check
      caught   used_wrong_model_id
      caught   forgot_to_run_tests
      missed   typo_in_filename_once  (no rule: one-off or below threshold)

  improvement: 6 repeat-class failures -> 1 after learning (5 prevented by the new rules).
  The one-off that remains is correct to leave alone: procedural
  learning codifies PATTERNS, not every single accident.

Read the four sections in order, because they show the loop closing. The mine step found five distinct failure tags but only four are recurring: flaky_network_once happened a single time and stays below the threshold, so it earns no rule. The synthesize step turned the four recurring tags into four imperative lines and appended them under a "Learned rules" heading, leaving the base instructions untouched. The evaluate step is the honest part: of six failures the rule-writer never saw, the new rules would have caught five, taking the held-out failure count from six to one.

The threshold is doing real work, and it is the same judgment a careful human applies. Of the held-out failures, the one the rules miss (typo_in_filename_once) is exactly the kind of thing you should not write a rule for. If every single accident became a permanent line in the system prompt, the instruction file would bloat into hundreds of brittle rules, most of them firing on situations that will never recur, each one spending tokens on every call (Chapter 2) and crowding out the rules that matter. Procedural learning codifies patterns, not history. The threshold is where you set how much evidence a behavior change requires.

Why this is a context-engineering technique, not just a logging trick

It would be easy to read this as "keep a log of bugs," but the payload is a context change, and that is what puts it in this book. The learned rules are not stored in a database the agent queries; they are written into the system prompt, the most expensive and most privileged real estate in the context. Every rule you add is paid for on every single call for the life of the project, so the threshold is not just noise control, it is a budget decision: a rule has to prevent enough failures to justify its standing token cost.

That framing also tells you the failure mode. A procedural-learning loop with no threshold, or one that mines too aggressively, produces a system prompt that grows without bound, contradicts itself ("always commit" learned in one session, "never commit" in another), and slowly degrades the agent it was meant to improve. The discipline is the same one from Chapter 1: the system prompt must stay lean. A good procedural-learning system adds a rule only when the evidence clears the bar, phrases it once and crisply, and is willing to retire a rule whose failures have stopped happening.

Where you meet this in the wild

The named version of semantic versus episodic versus procedural memory comes from LangChain's LangMem, which treats these as distinct stores and, for the procedural kind, can update an agent's system prompt from accumulated feedback rather than just remembering facts. The specific "mine past transcripts and rewrite the rules file" loop shows up in tools built around coding agents (for example a learn command that reads prior sessions and proposes edits to AGENTS.md or CLAUDE.md). Underneath the product names the shape is the one above: failures in, recurring patterns out, instruction file rewritten.

The use cases are where this earns its keep. A self-improving coding agent that stops repeating the same review-comment-worthy mistake after it makes it twice. Codifying team conventions automatically: if three engineers keep getting the same nit in review, that nit is a rule the agent could learn and apply before the human ever sees the diff. Reducing repeated review comments in general, by turning the feedback an agent receives into rules it will not need the feedback for next time. In each case the win is the same: a behavior that used to require a human to catch every time is converted, once, into a standing instruction.

Using the real tool: commands and before/after proof

The from-scratch demo above is the whole mechanism. In practice you reach for two real tools: the coding agent that already keeps a procedural-memory file, and a library that updates one from feedback. Here is how each maps onto the pipeline you just ran.

Claude Code: the instruction file is the procedural memory, and you grow it

Claude Code (Anthropic's CLI agent, running model claude-opus-4-8) keeps its procedural memory in a CLAUDE.md at the repo root, and it can write the first draft for you. The /init slash command reads the repository and generates the file:

# Inside the repo, start Claude Code and run the init command:
claude

> /init
# Claude reads the project (build scripts, source layout, conventions it can infer)
# and writes a CLAUDE.md describing how to work here. You then edit it by hand.

What /init produces is exactly procedural memory: not facts the agent looks up, but rules of conduct prepended to the system prompt on every call. This very repository's own CLAUDE.md is the worked example we have been describing. Every line in it is a rule that was learned the hard way and written down so the next session does not relearn it:

- Build all six books before pushing (one broken SUMMARY link takes down the whole deploy).
- No em dashes or en dashes in prose (a reliable AI tell reviewers kept flagging).
- Humanize prose before committing (run it through the humanizer checklist).
- End commit messages with the Co-Authored-By and Claude-Session trailers.

Those are the rules. Because the file is re-injected at the start of every session, the agent stops repeating those specific mistakes. The update step, the thing procedural learning automates, is just appending a newly learned rule to the same file:

# A new failure pattern showed up (the agent kept fabricating command output).
# Codify it once, and every future session inherits the rule:
cat >> CLAUDE.md <<'RULE'
- Never fabricate code output; run the snippet and paste the real text block.
RULE
# For agents that read AGENTS.md instead of CLAUDE.md, append to that file the same way.

That >> append is the synthesize step from the pipeline, done by hand. The mining can be automated too: headroom learn (from the headroom-ai package) reads past session transcripts, finds the failed tool calls, correlates them with what eventually succeeded, and writes the learnings into CLAUDE.md or AGENTS.md for you. The shape is identical to the demo: failures in, recurring patterns out, instruction file rewritten.

LangMem: optimize a procedural instruction from feedback

LangMem (from LangChain) is the library that names semantic, episodic, and procedural memory as distinct stores. For the procedural kind it gives you a prompt optimizer: feed it past runs plus feedback and it proposes an improved system prompt. Install it:

pip install -U langmem

The minimal procedural-memory use is create_prompt_optimizer. You give it a model, a set of trajectories (each one a conversation paired with feedback about how it went), and the current prompt; it returns an updated prompt with the lesson folded in:

from langmem import create_prompt_optimizer

# A trajectory is (conversation, feedback). Feedback can be None, a {"score", "comment"}
# dict, or a corrected response. Here the agent committed without being asked:
trajectories = [
    (
        [
            {"role": "user", "content": "Fix the typo in chapter 3."},
            {"role": "assistant", "content": "Fixed and pushed the commit."},
        ],
        {"score": 0.0, "comment": "It committed and pushed without being asked."},
    ),
]

optimizer = create_prompt_optimizer(
    "anthropic:claude-opus-4-8",   # the model that does the rewriting
    kind="prompt_memory",          # also: "metaprompt", "gradient"
)

before = "You are a careful coding assistant."
after = optimizer.invoke({"trajectories": trajectories, "prompt": before})
# `after` is the updated prompt string, now carrying a rule like
# "do not commit or push unless the user explicitly asks."

This is the same loop as the demo, with the LLM doing the synthesize step: the feedback is the mined failure, and the returned string is the instruction file's "after" version. (We do not paste a captured run here because the anthropic SDK and a live key are not installed on this box; treat the snippet as follow-along and run it where the key is set. The official quickstart at langchain-ai.github.io/langmem shows the same call against a live model.)

The proof that matters: repeat-failure rate, before vs after

A learned rule is only worth its token cost if it actually stops the failure from recurring, so measure it. The metric is the repeat-failure rate: on a fixed set of tasks, how many hit a known failure mode. The recipe is the evaluate step you already ran, applied to the real tool:

Build a held-out eval set: tasks that previously triggered a known failure mode (committed without asking, used the wrong model id, shipped a broken link).
Run them with the before instructions and count how many repeat the failure.
Add the learned rule to CLAUDE.md / AGENTS.md (or apply LangMem's updated prompt).
Run the same set with the after instructions and count again.
The drop from the before count to the after count is the prevented-failure number.

An illustrative result, the kind you would expect on a small set:

(illustrative / expected)
repeat-failure rate on 6 held-out tasks:
  BEFORE rules:  6 / 6 repeated a known failure mode
  AFTER  rules:  1 / 6  (5 prevented; the 1 left is a true one-off, correct to leave alone)

That table is labeled illustrative because the numbers depend on your eval set and model. The honest, on-box version of exactly this measurement is the from-scratch demo earlier in this chapter: it ran the held-out evaluation for real and printed failures prevented: 5 / 6, which is the same repeat-failure-rate proof against verified output rather than expected output. The real tools change who writes the rule (a CLI command, an LLM optimizer) but not how you prove it worked: you measure the repeat-failure rate before and after.

Takeaways

Procedural memory is how-to rules that change behavior, stored in the system prompt and applied on every call. It is distinct from semantic memory (facts the agent knows, recalled on demand) and episodic memory (whole past sessions recalled as examples).
A project's CLAUDE.md / AGENTS.md is procedural memory written by hand: each rule is a lesson from a past failure. Procedural learning automates accumulating those rules from observed mistakes.
The pipeline is mine, synthesize, evaluate: count failure tags, promote the recurring ones to imperative rules, append them to the instruction file, and measure prevented failures on a held-out set.
A frequency threshold is the core control. One-off accidents stay out; only patterns earn a rule, because every rule costs tokens on every call and a bloated system prompt degrades the agent it was meant to fix.
In the demo, four mined rules took a held-out failure count from six to one, leaving exactly the single one-off that no rule should cover.

👉 We now have an agent that knows facts, remembers conversations, and has learned its own rules of conduct. The next chapter steps up a level to context orchestration: deciding, per turn, which of these pieces (which facts, which memories, which rules, which tools) actually get assembled into the context the model sees.

Context Engineering in Depth