Semantic and response caching

Chapter 6 cached the stable front of the prompt so the model did not re-read it, but the model still ran on every call. This chapter goes one step further: when a new question is close enough to one you have already answered, return the stored answer and skip the model entirely. No prefill, no decode, no output tokens. The whole call collapses into a lookup.

That is a much bigger saving than prefix caching, and it comes with a risk prefix caching does not have. Prefix caching reuses input only on an exact byte match, so a reused prefix is always correct by construction. Semantic caching reuses the answer on an approximate match, so it can be wrong: serve a cached answer to a question that only looked similar, and the user gets a confidently wrong reply. The whole craft is in setting the match bar high enough to avoid that.

The idea: answer once, serve many

Support and FAQ traffic is full of repeats. "How do I reset my password?", "How can I reset my password?", and "I forgot my password, how do I reset it?" are three spellings of one question with one answer. A naive system calls the model three times and pays three times. A semantic cache calls the model once, stores the answer, and serves the other two from memory.

To do that, the cache needs to decide whether a new question means the same thing as one it has seen. It cannot compare the raw text, because the words differ. It needs a way to measure meaning. That is what an embedding gives you.

Embeddings: turning text into a vector

An embedding is a function from text to a fixed-length list of numbers (a vector), built so that texts with similar meaning produce vectors that point in similar directions. A production embedder is a trained neural network, but the mechanics of comparing by vector do not depend on how good the embedder is, so we can build a crude one from scratch and the caching logic around it is identical to the real thing.

Our toy embedder hashes the words of a question into a fixed-size vector. Each content word adds to a slot, the vector is scaled to unit length, and two questions that share content words end up pointing the same way. The demo below explains it line by line.

Cosine similarity: measuring the angle

Given two unit vectors, how "similar" are they? Use cosine similarity, the cosine of the angle between them:

$$\text{cos}(a, b) = \frac{a \cdot b}{\lVert a \rVert , \lVert b \rVert}$$

Here $a \cdot b$ is the dot product (multiply the two vectors element by element and sum the results), and $\lVert a \rVert$ is the length (norm) of $a$. The ratio runs from $-1$ (opposite meaning) through $0$ (unrelated) to $1$ (identical direction). When both vectors are already unit length, the denominator is $1 \times 1 = 1$, so cosine similarity is just the dot product. That is why the embedder normalizes: it makes the comparison a single cheap multiply-and-add.

The threshold: hit, miss, and the line between them

A threshold is the cutoff that turns a similarity score into a yes/no decision. Pick a number, say $0.8$. For a new question, embed it, score it against every cached question, and take the best match.

  • If the best score is at or above the threshold, that is a hit: return the stored answer and skip the model.
  • If the best score is below the threshold, that is a miss: call the model, store the new (question, answer) pair, and return the fresh answer.

Raise the threshold and the cache becomes strict: fewer hits, but every hit is a genuine match. Lower it and the cache becomes greedy: more hits, but eventually it starts serving cached answers to questions that only share a few words. That trade-off is the heart of the chapter, and the demo makes it concrete.

Don't be confused. Prefix caching (Chapter 6) and semantic caching are both "caching", but they reuse different things under different match rules. Prefix caching reuses the input tokens of a stable prompt prefix, keyed on an exact byte match, and the model still runs to produce the answer: a reused prefix is always correct because the bytes are identical. Semantic caching reuses the whole answer, keyed on an approximate similarity match, and the model does not run at all. The payoff is larger (you skip generation, the expensive half from Chapter 2) and so is the failure mode: an approximate match can be wrong, handing one question's answer to a different question. Prefix caching can never serve a wrong answer; semantic caching can, and guarding against that is the threshold's job.

The cache, from scratch

The demo builds the whole thing in NumPy and the standard library: the hashing embedder, cosine similarity, the threshold decision, and a stub model that stands in for a real call. It runs a workload of ten questions, several of which are paraphrases of earlier ones, and reports the hit rate plus the latency and dollars saved. Then it lowers the threshold and demonstrates a false hit: a different question wrongly served a cached answer.

"""Semantic (response) caching from scratch: return a STORED ANSWER for a
SIMILAR question, skipping inference entirely.

Chapter 6 caches the prompt PREFIX (the stable input tokens) and still runs the
model. This is different: we cache the whole ANSWER, and on a near-duplicate
question we skip the model call completely. That is a much bigger saving and a
new risk (a FALSE HIT: a different question served the wrong cached answer).

Four pieces, all in numpy + stdlib:
  1. A tiny embedding: hash each word into a fixed-dim vector, sum, L2-normalize.
     No external embedding model. (Real caches use a learned sentence embedder;
     the mechanics of compare-by-vector are identical.)
  2. Cosine similarity between a new question's vector and each cached vector.
  3. A THRESHOLD: best match >= threshold -> HIT (return stored answer, skip LLM);
     else MISS (call the stub LLM, store (question, answer), return it).
  4. The threshold trade-off: lower it and watch a different question get a wrong
     cached answer.

Standard library + numpy only. Run:  python3 semantic_cache.py
"""

import hashlib

import numpy as np


def _hash(s):
    """A deterministic hash of a string to a non-negative int. Python's built-in
    hash() is randomized per process (so output would change run to run); md5
    keeps this demo reproducible. Any fixed hash works; the values are arbitrary."""
    return int.from_bytes(hashlib.md5(s.encode()).digest()[:8], "big")

# --- Cost / latency model (one place to change the assumptions) ---------------
# A cache lookup is cheap and local; an LLM call is slow and billed per token.
LOOKUP_MS = 10        # embed + cosine search, in milliseconds
LLM_MS = 1500         # one model call, in milliseconds
OUT_TOKENS = 500      # tokens the model writes per answer
PRICE_PER_MTOK = 25.0 # output price, US dollars per million tokens (claude-opus-4-8)
COST_PER_CALL = (OUT_TOKENS / 1e6) * PRICE_PER_MTOK  # dollars per LLM call

DIM = 256  # embedding dimension; bigger = fewer hash collisions

# Words that carry no topic ('how', 'do', 'the', ...). A real embedder learns to
# down-weight these; we just drop them so the vector reflects the CONTENT words.
STOPWORDS = set(
    "how do i can you what is are your please tell about me a an the to it of "
    "my your for".split()
)


# --- 1. A tiny embedding: hashing bag-of-words --------------------------------
def embed(text):
    """Turn text into a unit vector so that similar questions land near each other.

    'Embedding' just means: a function from text to a fixed-length vector of
    numbers, built so that texts with similar meaning point in similar directions.
    A real embedder is a trained network; this one is a cheap stand-in with two
    honest tricks so plain word-overlap is not fooled by filler:

      * drop STOPWORDS, so 'how do i' does not dominate the vector;
      * for each remaining word, hash the WORD and also its character trigrams
        (3-letter windows). Trigrams make 'refund' and 'refunds' look alike and
        give content words more weight than a single slot would.

    We hash each feature to an index in [0, DIM) and add to that slot, building a
    bag-of-features count vector, then L2-normalize it (divide by its length) so
    every vector sits on the unit sphere. Normalizing is what lets cosine
    similarity below be a plain dot product.
    """
    v = np.zeros(DIM, dtype=np.float64)
    for raw in text.lower().split():
        word = raw.strip("?.,!;:'\"")
        if not word or word in STOPWORDS:
            continue
        v[_hash("word:" + word) % DIM] += 1.0           # the whole word
        for i in range(len(word) - 2):                  # its character trigrams
            v[_hash("tri:" + word[i:i + 3]) % DIM] += 0.5
    norm = np.linalg.norm(v)
    return v / norm if norm > 0 else v


# --- 2. Cosine similarity -----------------------------------------------------
def cosine(a, b):
    """Cosine similarity: the cosine of the angle between two vectors.

      cos(a, b) = (a . b) / (||a|| ||b||)

    The dot product a . b sums the element-wise products; ||a|| is a's length.
    The result runs from -1 (opposite) through 0 (unrelated) to 1 (identical
    direction). Because embed() already returns unit-length vectors, ||a|| and
    ||b|| are both 1, so this reduces to the dot product a . b.
    """
    return float(np.dot(a, b))


# --- 3. The stub LLM (what we are trying to avoid calling) --------------------
LLM_CALLS = 0  # count real model calls so we can prove the saving

def stub_llm(question):
    """Stand-in for a real model call: slow and expensive in the cost model.
    Returns a canned 'answer' so the demo is deterministic and offline."""
    global LLM_CALLS
    LLM_CALLS += 1
    return f"[answer to: {question!r}]"


# --- The cache ----------------------------------------------------------------
class SemanticCache:
    def __init__(self, threshold):
        self.threshold = threshold
        self.vectors = []   # cached question embeddings
        self.questions = [] # the original question text (for inspection)
        self.answers = []   # the stored answer for each cached question

    def lookup(self, question):
        """Return (answer, matched_question, score) on a hit, or None on a miss.
        A hit means: some cached question's embedding is at least `threshold`
        similar to this one. We return the closest match above the line."""
        if not self.vectors:
            return None
        q = embed(question)
        scores = [cosine(q, v) for v in self.vectors]
        best = int(np.argmax(scores))
        if scores[best] >= self.threshold:
            return self.answers[best], self.questions[best], scores[best]
        return None

    def store(self, question, answer):
        self.vectors.append(embed(question))
        self.questions.append(question)
        self.answers.append(answer)

    def ask(self, question):
        """The full path: try the cache, else call the model and cache the result.
        Returns (answer, was_hit, latency_ms, dollars_spent)."""
        hit = self.lookup(question)
        if hit is not None:
            answer, _, _ = hit
            return answer, True, LOOKUP_MS, 0.0
        answer = stub_llm(question)
        self.store(question, answer)
        # A miss pays the lookup AND the model call (we looked first, found nothing).
        return answer, False, LOOKUP_MS + LLM_MS, COST_PER_CALL


# --- A workload with deliberate near-duplicate paraphrases --------------------
# Three distinct intents, each asked several ways. A good cache should serve the
# paraphrases from the answer it computed the first time.
WORKLOAD = [
    "How do I reset my password?",            # intent A, first time -> MISS
    "What is your refund policy?",            # intent B, first time -> MISS
    "How can I reset my password?",           # A paraphrase -> should HIT
    "I forgot my password, how do I reset it?",  # A paraphrase -> should HIT
    "How do I get a refund?",                 # B paraphrase -> should HIT
    "Where is your office located?",          # intent C, first time -> MISS
    "How do I change my password?",           # A-ish paraphrase -> should HIT
    "Tell me about your refund policy please",   # B paraphrase -> should HIT
    "What are your office hours?",            # C-ish, different intent -> MISS
    "reset my password",                      # A paraphrase -> should HIT
]


def run(cache, label):
    print(f"=== {label} (threshold = {cache.threshold}) ===")
    global LLM_CALLS
    LLM_CALLS = 0
    hits = 0
    total_ms = 0.0
    total_dollars = 0.0
    for q in WORKLOAD:
        _, was_hit, ms, dollars = cache.ask(q)
        hits += was_hit
        total_ms += ms
        total_dollars += dollars
        tag = "HIT " if was_hit else "MISS"
        print(f"  {tag}  {q}")
    n = len(WORKLOAD)
    print(f"\n  hit rate:   {hits}/{n} = {hits / n:.0%}")
    print(f"  LLM calls:  {LLM_CALLS} (one per MISS)")
    return total_ms, total_dollars, hits


# --- BEFORE vs AFTER: no cache vs cache ---------------------------------------
n = len(WORKLOAD)
# Baseline: every question hits the model.
no_cache_ms = n * (LLM_MS)
no_cache_dollars = n * COST_PER_CALL

print("Cost model: cache lookup ~{}ms, LLM call ~{}ms and {} output tokens "
      "at ${}/Mtok = ${:.5f}/call.".format(
          LOOKUP_MS, LLM_MS, OUT_TOKENS, PRICE_PER_MTOK, COST_PER_CALL))
print(f"Workload: {n} questions, with paraphrases of earlier ones mixed in.\n")

print(f"--- BEFORE (no cache): every question calls the model ---")
print(f"  {n} LLM calls")
print(f"  latency: {no_cache_ms:,.0f} ms")
print(f"  cost:    ${no_cache_dollars:.5f}\n")

cache = SemanticCache(threshold=0.8)
cache_ms, cache_dollars, hits = run(cache, "AFTER (semantic cache)")

print(f"\n--- SAVED by the cache ---")
print(f"  latency: {no_cache_ms - cache_ms:,.0f} ms "
      f"({1 - cache_ms / no_cache_ms:.0%} faster)")
print(f"  cost:    ${no_cache_dollars - cache_dollars:.5f} "
      f"({1 - cache_dollars / no_cache_dollars:.0%} cheaper)")
print()


# --- 4. The threshold trade-off: a FALSE HIT ----------------------------------
# Lower the bar and the cache gets greedy: it starts serving cached answers to
# questions that only LOOK similar. Here two DIFFERENT intents share a content
# word ('cancel') and a loose threshold collapses them into one answer.
print("=== The threshold trade-off: a FALSE HIT ===")
first = "How do I cancel my subscription?"
second = "How do I cancel my flight?"   # different intent, shares 'cancel'
score = cosine(embed(first), embed(second))
print(f"  cached question:  {first}")
print(f"  new question:     {second}")
print(f"  similarity:       {score:.2f}\n")

# At the safe threshold the cache correctly MISSES and would ask the model.
safe = SemanticCache(threshold=0.8)
safe.store(first, stub_llm(first))
print(f"  threshold 0.8 -> "
      f"{'HIT' if safe.lookup(second) else 'MISS'}  (correct: these are different "
      f"questions, ask the model)")

# Drop the threshold below the score and the same pair becomes a wrong HIT.
greedy = SemanticCache(threshold=0.4)
greedy.store(first, stub_llm(first))
hit = greedy.lookup(second)
answer = hit[0] if hit else None
print(f"  threshold 0.4 -> {'HIT' if hit else 'MISS'}  served: {answer}")
print("  ^ WRONG. A loose threshold served the SUBSCRIPTION answer to a question")
print("    about a FLIGHT. This is the core risk of semantic caching: too low a")
print("    threshold trades correctness for hit rate. You cannot push hit rate to")
print("    100% without eventually serving someone the wrong answer.")

Running it:

Cost model: cache lookup ~10ms, LLM call ~1500ms and 500 output tokens at $25.0/Mtok = $0.01250/call.
Workload: 10 questions, with paraphrases of earlier ones mixed in.

--- BEFORE (no cache): every question calls the model ---
  10 LLM calls
  latency: 15,000 ms
  cost:    $0.12500

=== AFTER (semantic cache) (threshold = 0.8) ===
  MISS  How do I reset my password?
  MISS  What is your refund policy?
  HIT   How can I reset my password?
  HIT   I forgot my password, how do I reset it?
  MISS  How do I get a refund?
  MISS  Where is your office located?
  MISS  How do I change my password?
  HIT   Tell me about your refund policy please
  MISS  What are your office hours?
  HIT   reset my password

  hit rate:   4/10 = 40%
  LLM calls:  6 (one per MISS)

--- SAVED by the cache ---
  latency: 5,900 ms (39% faster)
  cost:    $0.05000 (40% cheaper)

=== The threshold trade-off: a FALSE HIT ===
  cached question:  How do I cancel my subscription?
  new question:     How do I cancel my flight?
  similarity:       0.43

  threshold 0.8 -> MISS  (correct: these are different questions, ask the model)
  threshold 0.4 -> HIT  served: [answer to: 'How do I cancel my subscription?']
  ^ WRONG. A loose threshold served the SUBSCRIPTION answer to a question
    about a FLIGHT. This is the core risk of semantic caching: too low a
    threshold trades correctness for hit rate. You cannot push hit rate to
    100% without eventually serving someone the wrong answer.

Read the AFTER block question by question. The first time each of the three topics appears (password, refund, office) it MISSes and calls the model: there is nothing to match against yet. After that, the paraphrases start hitting. "How can I reset my password?" scores high enough against the stored "How do I reset my password?" to clear the bar, so it returns the stored answer for free. So do "I forgot my password, how do I reset it?", "Tell me about your refund policy please", and the terse "reset my password".

Notice the two paraphrases that miss. "How do I get a refund?" is clearly about refunds, and a human would serve it the refund answer, but it shares almost no content words with the stored "What is your refund policy?" ("get" and "refund" versus "refund" and "policy"), so its score lands below $0.8$ and the cache plays it safe by asking the model. "How do I change my password?" misses too, and arguably it should: changing a password is a different operation from resetting one. These misses are the cost of a strict threshold. A higher hit rate is available, but only by lowering the bar, and the false-hit demo shows where that road ends.

What the saving is

The BEFORE/AFTER totals come straight from the cost model at the top of the file: a cache lookup is about 10ms, a model call is about 1500ms and writes about 500 output tokens at $25 per million tokens (the claude-opus-4-8 output rate from Chapter 2). On this ten-question workload, four hits cut six model calls down from ten, which is 40% fewer dollars and 39% less latency. The two numbers track each other because, in this simple model, every hit saves one full model call. Whether a real workload saves 5% or 60% depends entirely on its duplicate rate: the fraction of traffic that is a near-repeat of something already answered. FAQ bots and repeated analytics questions can be very high; open-ended creative requests can be near zero.

The false hit, and why the threshold is everything

The last block is the warning. Take two genuinely different questions that happen to share a content word: "How do I cancel my subscription?" and "How do I cancel my flight?". Both contain "cancel", so their vectors are not orthogonal; they score $0.43$. At the safe threshold of $0.8$ that is a MISS, which is correct: these are different questions, so the cache asks the model. Drop the threshold to $0.4$ and the same pair becomes a HIT, and now the cache hands the subscription-cancellation answer to someone asking about a flight. That is a false hit, and it is the one failure mode prefix caching cannot produce.

This is the trade-off in one picture. The threshold is a dial between two kinds of mistake. Set it too high and you get false misses: real paraphrases that should have hit but ask the model anyway, costing money you did not need to spend (the refund and change-password misses above). Set it too low and you get false hits: different questions served the wrong answer, costing trust. You cannot push the hit rate to 100% without eventually serving someone the wrong answer, so the right setting is conservative: better to pay for an extra model call than to confidently answer the wrong question. Most production caches land around $0.8$ to $0.85$ on a good embedder and tune from there against real traffic.

A real embedder changes the numbers but not the shape of this trade-off. A trained sentence embedder would score "How do I get a refund?" against "What is your refund policy?" much higher than our toy hashing embedder does, so it would catch that paraphrase at a safe threshold and lift the hit rate. It would also keep "cancel my subscription" and "cancel my flight" far apart. A better embedder buys you both a higher hit rate and fewer false hits at the same time, which is exactly why production systems use one. The threshold dial is still there; the embedder just makes its safe range wider.

How this looks in production

You do not build the embedder, the vector store, and the similarity search by hand for a real system. Two representative tools:

GPTCache is an open-source semantic cache. Its architecture is the same pipeline as the demo, split into named stages: an adapter that wraps the LLM call, a pre-processor that normalizes the incoming request, an embedding generator that turns the query into a vector, a cache manager that stores vectors and answers, a similarity evaluator that scores a new query against the stored ones (cosine similarity with a threshold around $0.8$, the same decision the demo makes), and a post-processor that returns the chosen answer. It plugs into orchestration frameworks like LangChain and LlamaIndex, so an existing app can route calls through the cache without rewriting them.

Redis LangCache is a managed version of the same idea: embedding, storage, and similarity search behind a single API call, so you do not run the vector store yourself. The managed search adds a small amount of latency (on the order of a handful of milliseconds) to find a match, which is the cache-lookup cost in our model, and on a hit it saves the full model call, which is the slow, expensive part. The economics are the BEFORE/AFTER of the demo: trade a cheap lookup for an expensive generation, on the fraction of traffic that repeats.

Where these pay off is exactly where the duplicate rate is high: FAQ and support bots, repeated analytics questions ("what were sales last quarter?" asked fifty ways), and any service with heavy near-duplicate traffic. Where they do not pay off is open-ended or per-user-unique work, where almost nothing repeats and every lookup is a guaranteed miss that still costs you the embedding step. The decision to add a semantic cache is, at bottom, a bet on how repetitive your traffic is.

Using the real tool: commands and before/after proof

The from-scratch demo above is the engine. In an app you wire in a packaged version of it instead of hand-rolling the embedder and vector search. Here is what that looks like with the two tools from the last section, and how you prove the win.

GPTCache: wrap the LLM call

GPTCache is an open-source semantic cache that sits in front of your model call. Install it:

pip install gptcache

The library is not installed on this box and the snippets below need a model API key, so treat them as follow-along: the calls are the real, documented API, but the output shown is illustrative, not run here. This is the semantic-cache setup straight from the GPTCache README. It builds the same pipeline as our demo (an embedder, a vector store, a distance-based similarity check) and points GPTCache's OpenAI adapter at it:

from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

onnx = Onnx()                                    # a real (small) sentence embedder
data_manager = get_data_manager(                 # where vectors and answers live
    CacheBase("sqlite"),
    VectorBase("faiss", dimension=onnx.dimension),
)
cache.init(
    embedding_func=onnx.to_embeddings,           # text -> vector (our toy embedder's job)
    data_manager=data_manager,                   # store + search (our cache dict's job)
    similarity_evaluation=SearchDistanceEvaluation(),  # the threshold decision
)
cache.set_openai_key()

SearchDistanceEvaluation is the threshold knob from the demo, now expressed as a distance (low distance means similar) rather than a similarity (high means similar). After cache.init, you call the model through openai from gptcache.adapter instead of the normal client. The adapter checks the cache first and only calls the model on a miss. You do not change your prompt; you change which client you call through. GPTCache also plugs into LangChain: you set it as the global LLM cache with LangChain's set_llm_cache(...), and every model call in the app routes through the cache without rewriting the call sites.

Redis LangCache / redisvl: a before/after you can time

Redis LangCache is the managed service; redisvl (the Redis Vector Library) is the open client that exposes the same SemanticCache you would run yourself against a Redis instance. It gives you a clean check/store pair, which makes the before/after easy to time. Install it:

pip install redisvl

Again follow-along (no Redis, no model key, no redisvl on this box; output is illustrative). The recipe: time two similar questions. The first is a cold MISS that pays for a full model call; the second is a HIT served from the cache with no model call.

import time
from redisvl.extensions.cache.llm import SemanticCache

cache = SemanticCache(
    name="llmcache",
    redis_url="redis://localhost:6379",
    distance_threshold=0.1,   # COSINE distance in [0, 2]; LOWER is stricter (0 = identical)
)

def answer(question):
    hit = cache.check(prompt=question)        # vector search against stored questions
    if hit:
        return hit[0]["response"], "HIT"      # served from cache, model NOT called
    reply = call_the_model(question)          # your real claude-opus-4-8 call goes here
    cache.store(prompt=question, response=reply)
    return reply, "MISS"

for q in ["How do I reset my password?",      # cold: nothing stored yet -> MISS
          "How can I reset my password?"]:     # paraphrase of the first -> HIT
    t0 = time.perf_counter()
    reply, status = answer(q)
    dt = (time.perf_counter() - t0) * 1000
    print(f"{status:4}  {dt:8.1f} ms  {q}")

Illustrative output (expected shape, not measured here):

MISS    1503.7 ms  How do I reset my password?
HIT       14.2 ms  How can I reset my password?

The MISS pays the full model call (about 1500 ms and a few hundred output tokens). The HIT is just the embed-and-search lookup (about 15 ms) and calls no model, so it costs no output tokens. That is roughly a 100x latency drop on the repeat and the entire avoided generation cost, which on the claude-opus-4-8 rates from Chapter 2 (output billed at 5x input) is where the dollars go.

The knob is distance_threshold. Note the inversion from our demo: redisvl uses cosine distance (lower is stricter, $0$ is identical), while the from-scratch demo used cosine similarity (higher is stricter, $1$ is identical). They are two faces of the same dial. Set it too loose (a high distance threshold here) and you get the false hit the demo proved on "cancel my subscription" versus "cancel my flight": a different question served the wrong stored answer. The on-box semantic_cache.py demo is the real, verified proof of that failure mode and of the BEFORE/AFTER arithmetic; the snippets here are the same logic packaged behind a tool.

In Claude Code / at the app layer

Semantic caching lives in your application, in front of the model call, and it does not care which provider you use: the embedder, the vector store, and the threshold are all yours. That makes it independent of, and complementary to, Anthropic's prompt caching from Chapter 6. Prompt caching is a server-side feature that reuses the input prefix tokens when the model does run; semantic caching is a client-side layer that skips the model entirely on a near-duplicate. Use both: prompt caching makes the calls you do make cheaper, and the semantic cache removes the calls you should not be making at all.

Takeaways

  • Semantic caching returns a stored answer for an approximately similar question and skips the model entirely, saving the expensive output half of the call, not just the input prefix that Chapter 6 reuses.
  • The decision is three pieces: an embedding (text to vector), cosine similarity (the angle between two vectors, a plain dot product on unit vectors), and a threshold (the cutoff that turns a score into hit or miss).
  • The threshold is a dial between false misses (paraphrases that needlessly call the model) and false hits (different questions served the wrong answer). False hits are the failure mode prefix caching cannot have, so tune conservatively, around $0.8$ on a good embedder.
  • The dollar and latency saving scales with your duplicate rate: high for FAQ/support bots and repeated analytics, near zero for open-ended or per-user-unique work.
  • A better embedder widens the safe threshold range, raising the hit rate and lowering false hits at once. Tools like GPTCache and Redis LangCache package the embedding, storage, and similarity search so you wire in the cache instead of building it.

👉 We have skipped the model on a cache hit. The next chapter goes back inside the model for the calls you cannot skip: how the KV-cache is served at scale, so the generation you do pay for is as fast and cheap as the hardware allows.