IVF — the inverted file index
IVF (Inverted File index) attacks the time problem: instead of scanning every vector, partition them into regions and scan only the regions near the query. It's the "go straight to the cooking section of the library" idea.
How it works
- Train: run k-means to get
nlistcentroids. These definenlistregions (mathematically, Voronoi cells: each point belongs to its nearest centroid). - Add: assign every database vector to its nearest centroid, and store its id in that centroid's list. These lists are the "inverted lists".
- Search: find the
nprobecentroids nearest the query, then do an exact search only over the vectors in those few lists.
The name "inverted file" comes from text search: instead of "document → words", you keep "region → vectors in it", and look up by region.
centroid 0 ─► [v3, v9, v17, ...]
query ──► centroid 1 ─► [v1, v5, v22, ...] ◄─ nprobe=2: scan
centroid 2 ─► [v8, v11, ...] ◄─ these two lists only
centroid 3 ─► [v2, v14, ...]
The code
import numpy as np
from ivfpq import kmeans, _assign # _assign: nearest-centroid helper
class IVF:
def __init__(self, nlist=64, nprobe=8, seed=0):
self.nlist = nlist # number of regions
self.nprobe = nprobe # regions scanned per query
self.seed = seed
def train(self, X):
self.centroids, _ = kmeans(X, self.nlist, seed=self.seed)
return self
def add(self, X):
self.X = np.asarray(X, dtype=np.float64)
a = _assign(self.X, self.centroids) # which region each vector
self.lists = [np.where(a == c)[0] for c in range(len(self.centroids))]
return self
def search(self, q, k=10, nprobe=None):
nprobe = nprobe or self.nprobe
dc = ((self.centroids - q) ** 2).sum(1) # distance to each region
cells = np.argsort(dc)[:nprobe] # the nprobe nearest regions
cand = np.concatenate([self.lists[c] for c in cells])
d = ((self.X[cand] - q) ** 2).sum(1) # EXACT within those regions
order = np.argsort(d)[:k]
return cand[order], d[order]
Reading it
trainlearns the region centers;addfiles each vector into its region's list (np.where(a == c)[0]collects the ids in regionc).searchranks regions by distance to the query, takes the closestnprobe, gathers their vectors, and does an ordinary exact search over just those. Note the result is exact within the probed regions — IVF's only error is when a true neighbor sits in a region we didn't probe.
The regions really do partition the data
Building an 8-region IVF over 1000 vectors files them into balanced lists:
import numpy as np
from ivfpq import IVF
data = np.random.default_rng(1).normal(size=(1000, 16))
ivf = IVF(nlist=8, seed=0).train(data); ivf.add(data)
print("cell sizes:", [len(l) for l in ivf.lists])
Output:
cell sizes: [126, 144, 99, 109, 139, 139, 119, 125]
Every vector landed in exactly one of the 8 lists, ~125 each. A query now scans a couple of lists (~250 vectors) instead of all 1000.
The nprobe dial
nprobe trades recall for speed, without rebuilding:
nprobe = 1— fastest, but misses neighbors that fall just across a region border.- larger
nprobe— scans more regions, higher recall, more work.
In the demo you'll see IVF hit recall 1.0 while scanning ~6% of
the data (nprobe=8 of nlist=128). That's the whole point: near-exact answers
at a fraction of the cost.
What IVF does not do
IVF keeps the full vectors — it saves time, not memory. A billion vectors still need terabytes. To shrink them, we need quantization, which is next. 👉