Redis: caching & feature lookups
We met Redis as Celery's broker, but that's a fraction of what it does. Redis is an in-memory key-value store — absurdly fast (sub-millisecond), and the Swiss Army knife of production systems. For ML it solves three recurring problems: caching predictions, serving features fast, and rate limiting. This chapter shows each.
Setup:
pip install redisand run a Redis server. Follow-along — needs a running Redis.
Why in-memory matters
Redis keeps data in RAM, so reads and writes take microseconds — orders of magnitude faster than a database or recomputing a model prediction. The trade-off: RAM is smaller and (by default) volatile. So you use Redis for hot, ephemeral data — things you can afford to lose and want now.
Use 1 — cache predictions
If the same text gets scored repeatedly (popular queries, repeated documents), don't re-run the model — cache the result. The pattern is cache-aside: check the cache; on a miss, compute and store it.
import json
import redis
r = redis.Redis(host="localhost", port=6379, db=0)
def cached_score(text, model, ttl=3600):
key = f"pred:{text}"
hit = r.get(key)
if hit is not None: # cache HIT — skip the model entirely
return json.loads(hit)
score = float(model.predict_proba([text])[0]) # cache MISS — compute it
r.set(key, json.dumps(score), ex=ttl) # store with a 1-hour expiry
return score
The ex=ttl gives the key a time-to-live — it auto-expires after an hour, so the
cache never grows unbounded and stale predictions self-clean. Expected behavior:
first call ("great product") -> MISS, runs model, stores -> 0.9768 (~5 ms)
second call ("great product") -> HIT, returns from Redis -> 0.9768 (~0.2 ms)
The second call is ~25× faster because it never touched the model. For an expensive model (an LLM call costing money and seconds), caching is the single highest-impact optimization you can make.
Don't be confused: caching vs. the result backend. In Chapter 6 Redis stored task results so a producer could fetch them. Here it caches model outputs to avoid recomputation. Same Redis, different purpose — it's a general key-value store, and these are just two of its uses.
Use 2 — a feature store / fast feature lookups
Online models need features now: when a request arrives for user:123, you must
fetch that user's features in single-digit milliseconds. A database might be too slow;
Redis is built for it. Redis hashes store an object's fields under one key:
# precompute features offline, write them to Redis
r.hset("user:123", mapping={"avg_order": "84.50", "n_sessions": "12", "tier": "gold"})
# at request time, fetch instantly
features = r.hgetall("user:123")
# -> {b'avg_order': b'84.50', b'n_sessions': b'12', b'tier': b'gold'}
This is the core of an online feature store (the foundations book's MLOps
chapter): batch jobs compute features and push them to Redis;
the serving model reads them with hgetall in under a millisecond. Tools like Feast
use exactly this pattern (often with Redis as the online store) — and they add the
crucial guarantee that training and serving compute features identically, killing
training/serving skew.
Use 3 — rate limiting
Protect your API from abuse by counting requests per user per time window. Redis's
atomic INCR plus a TTL makes this a few lines:
def allow_request(user_id, limit=100, window=60):
key = f"rate:{user_id}"
count = r.incr(key) # atomic increment (creates the key at 1)
if count == 1:
r.expire(key, window) # first hit sets the 60-second window
return count <= limit # allow until the limit is hit
requests 1..100 -> allow_request -> True
request 101 -> allow_request -> False (blocked until the window resets)
Because INCR is atomic, this is correct even with thousands of concurrent
requests — no race conditions. The key auto-expires, resetting the window.
The data structures you'll actually use
| Structure | Command | ML use |
|---|---|---|
| String | SET/GET | cache a prediction or config |
| Hash | HSET/HGETALL | a feature vector / user profile |
| Counter | INCR/EXPIRE | rate limiting, metrics |
| Sorted set | ZADD/ZRANGE | leaderboards, top-K, recent items |
| TTL | EX/EXPIRE | auto-expiring cache entries |
Production notes
- TTLs everywhere — cache entries should expire, or Redis fills up. Set an
eviction policy (
allkeys-lru) so it drops the least-used keys under pressure. - Persistence — Redis can persist to disk (RDB/AOF), but treat it as a cache you can rebuild, not your source of truth.
- It's a network hop — fast, but not free; batch reads (
MGET, pipelines) when you can. - Cache invalidation — "one of the two hard problems." When the model changes,
old cached predictions are wrong — version your cache keys (
pred:v2:...) so a new model can't serve stale answers.
The takeaway
Redis is the in-memory workhorse: cache predictions (cache-aside with a TTL) to skip expensive recomputation, serve features in sub-millisecond lookups (the online feature-store pattern), and rate-limit with atomic counters. Always set TTLs, version your cache keys, and treat it as rebuildable. We can now serve fast and scale work — but which data trained the model we're serving? Time to version it. 👉