Project 3 — Build an LLM agent

An agent is an LLM that can act: instead of just emitting text, it calls tools (a calculator, a search, an API, a database), observes the results, and decides what to do next — looping until it has an answer. This is the pattern behind "agentic AI," Claude Code, and every assistant that does things. We build the loop from scratch so the mechanics are crystal clear, then show the real Claude version.

Full code: code/projects/agent.py (pure Python — no dependencies).

Why agents exist

An LLM alone can't do arithmetic reliably, look up today's price, or query your database — that knowledge isn't in its weights, and it can't run code. Give it tools and a loop, and it can: reason about what it needs, call the right tool, read the result, and continue. That turns a text predictor into a problem-solver.

The ReAct loop: Thought → Action → Observation

The dominant pattern is ReAct (Reason + Act). The model alternates between thinking about what to do and acting (calling a tool), reading each tool's observation, until it can answer:

Question → Thought → Action(tool) → Observation → Thought → … → Answer

Tools are just functions

def calculator(expr):
    return str(eval(expr, {"__builtins__": {}}))   # locked-down eval

def knowledge_lookup(query):
    return FACTS.get(...)                            # a stand-in for search / a DB

TOOLS = {"calculator": calculator, "knowledge_lookup": knowledge_lookup}

The loop

def run(question, max_steps=5):
    scratchpad = []
    for _ in range(max_steps):
        action, arg = decide(question, scratchpad)   # the LLM picks the next move
        if action == "answer":
            return arg
        observation = TOOLS[action](arg)             # run the chosen tool
        scratchpad += [f"Action: {action}('{arg}')", f"Observation: {observation}"]

decide() is the brain — given the question and everything observed so far, it returns the next action. In a real agent that's an LLM call; here it's a deterministic stub so the loop runs and is reproducible.

Running it

$ python agent.py

Output:

Question: What is 23 * 19 + 7?
Thought: I should use calculator('23*19+7')
Action: calculator('23*19+7')
Observation: 444
Answer: 444
----------------------------------------
Question: What is the speed of light?
Thought: I should use knowledge_lookup('speed of light')
Action: knowledge_lookup('speed of light')
Observation: 299,792,458 m/s
Answer: 299,792,458 m/s

The agent routed each question to the right tool — arithmetic to the calculator (getting 444 exactly, which an LLM might fumble), and a fact to the lookup tool — then answered from the observation. That routing-and-looping is agency.

The real thing: Claude with tool use

In production, decide() is a Claude API call, and the model itself chooses the tool. You declare your tools as JSON schemas; Claude responds with a tool_use request; you run the tool and feed the tool_result back; repeat until it stops. The loop is identical to ours:

import anthropic
client = anthropic.Anthropic()

tools = [{
    "name": "calculator",
    "description": "Evaluate an arithmetic expression",
    "input_schema": {"type": "object",
                     "properties": {"expr": {"type": "string"}},
                     "required": ["expr"]},
}]
messages = [{"role": "user", "content": "What is 23 * 19 + 7?"}]

while True:
    resp = client.messages.create(model="claude-opus-4-8", max_tokens=1024,
                                  tools=tools, messages=messages)
    if resp.stop_reason != "tool_use":
        break                                        # Claude has its final answer
    messages.append({"role": "assistant", "content": resp.content})
    results = []
    for block in resp.content:
        if block.type == "tool_use":                 # Claude asked to call a tool
            out = TOOLS[block.name](**block.input)    # you run it
            results.append({"type": "tool_result", "tool_use_id": block.id,
                            "content": str(out)})
    messages.append({"role": "user", "content": results})   # feed results back

stop_reason == "tool_use" is the model saying "run this tool and tell me the result" — the API-native version of our Action/Observation step. (The SDK's tool runner automates this loop entirely; the manual version above shows what it does.)

What makes agents hard

The loop is simple; making it reliable is the frontier:

Error compounding — a wrong step early derails everything after it. Long loops are fragile.
Cost & latency — every step is an LLM call; agents are slow and expensive.
Tool design — clear tool names, descriptions, and schemas dramatically change how well the model uses them.
Termination & guardrails — cap the steps (our max_steps), validate tool inputs, gate destructive actions behind confirmation.

Frameworks (LangChain, LlamaIndex, the Claude Agent SDK) package the loop, tool plumbing, and memory — but it's the loop you just built.

Don't be confused: an agent is a loop, not a model. The "agent" isn't a special kind of LLM — it's an ordinary LLM wrapped in your loop that lets it call tools and see results. The intelligence is the model; the agency is the harness around it.

Make it production

Trace every step (cost, latency, which tools fired) — agents are multi-step, so tracing is essential (the tools book's LLM-observability chapter).
Gate side effects — tools that send email, spend money, or delete data go behind confirmation.
Cap and budget — max steps and a token budget, or an agent can loop forever and run up a bill.

The takeaway

An agent is an LLM in a Thought→Action→Observation loop with tools — you built the loop, and the real Claude version is the same loop with tool_use/tool_result messages. It turns a predictor into something that acts. The hard parts are reliability, cost, and tool design, not the loop itself. Next, we leave language for vision: train a CNN image classifier from scratch. 👉

AI Foundations in Depth