Redis: caching & feature lookups

We met Redis as Celery's broker, but that's a fraction of what it does. Redis is an in-memory key-value store — absurdly fast (sub-millisecond), and the Swiss Army knife of production systems. For ML it solves three recurring problems: caching predictions, serving features fast, and rate limiting. This chapter shows each.

Setup: pip install redis and run a Redis server. Follow-along — needs a running Redis.

Why in-memory matters

Redis keeps data in RAM, so reads and writes take microseconds — orders of magnitude faster than a database or recomputing a model prediction. The trade-off: RAM is smaller and (by default) volatile. So you use Redis for hot, ephemeral data — things you can afford to lose and want now.

Use 1 — cache predictions

If the same text gets scored repeatedly (popular queries, repeated documents), don't re-run the model — cache the result. The pattern is cache-aside: check the cache; on a miss, compute and store it.

import json
import redis

r = redis.Redis(host="localhost", port=6379, db=0)

def cached_score(text, model, ttl=3600):
    key = f"pred:{text}"
    hit = r.get(key)
    if hit is not None:                       # cache HIT — skip the model entirely
        return json.loads(hit)
    score = float(model.predict_proba([text])[0])   # cache MISS — compute it
    r.set(key, json.dumps(score), ex=ttl)     # store with a 1-hour expiry
    return score

The ex=ttl gives the key a time-to-live — it auto-expires after an hour, so the cache never grows unbounded and stale predictions self-clean. Expected behavior:

first call ("great product")   -> MISS, runs model, stores -> 0.9768   (~5 ms)
second call ("great product")  -> HIT,  returns from Redis  -> 0.9768   (~0.2 ms)

The second call is ~25× faster because it never touched the model. For an expensive model (an LLM call costing money and seconds), caching is the single highest-impact optimization you can make.

Don't be confused: caching vs. the result backend. In Chapter 6 Redis stored task results so a producer could fetch them. Here it caches model outputs to avoid recomputation. Same Redis, different purpose — it's a general key-value store, and these are just two of its uses.

Use 2 — a feature store / fast feature lookups

Online models need features now: when a request arrives for user:123, you must fetch that user's features in single-digit milliseconds. A database might be too slow; Redis is built for it. Redis hashes store an object's fields under one key:

# precompute features offline, write them to Redis
r.hset("user:123", mapping={"avg_order": "84.50", "n_sessions": "12", "tier": "gold"})

# at request time, fetch instantly
features = r.hgetall("user:123")
# -> {b'avg_order': b'84.50', b'n_sessions': b'12', b'tier': b'gold'}

This is the core of an online feature store (the foundations book's MLOps chapter): batch jobs compute features and push them to Redis; the serving model reads them with hgetall in under a millisecond. Tools like Feast use exactly this pattern (often with Redis as the online store) — and they add the crucial guarantee that training and serving compute features identically, killing training/serving skew.

Use 3 — rate limiting

Protect your API from abuse by counting requests per user per time window. Redis's atomic INCR plus a TTL makes this a few lines:

def allow_request(user_id, limit=100, window=60):
    key = f"rate:{user_id}"
    count = r.incr(key)             # atomic increment (creates the key at 1)
    if count == 1:
        r.expire(key, window)       # first hit sets the 60-second window
    return count <= limit           # allow until the limit is hit

requests 1..100  -> allow_request -> True
request 101      -> allow_request -> False   (blocked until the window resets)

Because INCR is atomic, this is correct even with thousands of concurrent requests — no race conditions. The key auto-expires, resetting the window.

The data structures you'll actually use

Structure	Command	ML use
String	`SET`/`GET`	cache a prediction or config
Hash	`HSET`/`HGETALL`	a feature vector / user profile
Counter	`INCR`/`EXPIRE`	rate limiting, metrics
Sorted set	`ZADD`/`ZRANGE`	leaderboards, top-K, recent items
TTL	`EX`/`EXPIRE`	auto-expiring cache entries

Production notes

TTLs everywhere — cache entries should expire, or Redis fills up. Set an eviction policy (allkeys-lru) so it drops the least-used keys under pressure.
Persistence — Redis can persist to disk (RDB/AOF), but treat it as a cache you can rebuild, not your source of truth.
It's a network hop — fast, but not free; batch reads (MGET, pipelines) when you can.
Cache invalidation — "one of the two hard problems." When the model changes, old cached predictions are wrong — version your cache keys (pred:v2:...) so a new model can't serve stale answers.

The takeaway

Redis is the in-memory workhorse: cache predictions (cache-aside with a TTL) to skip expensive recomputation, serve features in sub-millisecond lookups (the online feature-store pattern), and rate-limit with atomic counters. Always set TTLs, version your cache keys, and treat it as rebuildable. We can now serve fast and scale work — but which data trained the model we're serving? Time to version it. 👉

Production ML & AI Tools: A Hands-On Field Guide