IVF — the inverted file index

IVF (Inverted File index) attacks the time problem: instead of scanning every vector, partition them into regions and scan only the regions near the query. It's the "go straight to the cooking section of the library" idea.

How it works

  1. Train: run k-means to get nlist centroids. These define nlist regions (mathematically, Voronoi cells: each point belongs to its nearest centroid).
  2. Add: assign every database vector to its nearest centroid, and store its id in that centroid's list. These lists are the "inverted lists".
  3. Search: find the nprobe centroids nearest the query, then do an exact search only over the vectors in those few lists.

The name "inverted file" comes from text search: instead of "document → words", you keep "region → vectors in it", and look up by region.

                centroid 0 ─► [v3, v9, v17, ...]
   query  ──►   centroid 1 ─► [v1, v5, v22, ...]   ◄─ nprobe=2: scan
                centroid 2 ─► [v8, v11, ...]        ◄─ these two lists only
                centroid 3 ─► [v2, v14, ...]

The code

import numpy as np
from ivfpq import kmeans, _assign       # _assign: nearest-centroid helper


class IVF:
    def __init__(self, nlist=64, nprobe=8, seed=0):
        self.nlist = nlist      # number of regions
        self.nprobe = nprobe    # regions scanned per query
        self.seed = seed

    def train(self, X):
        self.centroids, _ = kmeans(X, self.nlist, seed=self.seed)
        return self

    def add(self, X):
        self.X = np.asarray(X, dtype=np.float64)
        a = _assign(self.X, self.centroids)               # which region each vector
        self.lists = [np.where(a == c)[0] for c in range(len(self.centroids))]
        return self

    def search(self, q, k=10, nprobe=None):
        nprobe = nprobe or self.nprobe
        dc = ((self.centroids - q) ** 2).sum(1)           # distance to each region
        cells = np.argsort(dc)[:nprobe]                   # the nprobe nearest regions
        cand = np.concatenate([self.lists[c] for c in cells])
        d = ((self.X[cand] - q) ** 2).sum(1)              # EXACT within those regions
        order = np.argsort(d)[:k]
        return cand[order], d[order]

Reading it

  • train learns the region centers; add files each vector into its region's list (np.where(a == c)[0] collects the ids in region c).
  • search ranks regions by distance to the query, takes the closest nprobe, gathers their vectors, and does an ordinary exact search over just those. Note the result is exact within the probed regions — IVF's only error is when a true neighbor sits in a region we didn't probe.

The regions really do partition the data

Building an 8-region IVF over 1000 vectors files them into balanced lists:

import numpy as np
from ivfpq import IVF

data = np.random.default_rng(1).normal(size=(1000, 16))
ivf = IVF(nlist=8, seed=0).train(data); ivf.add(data)
print("cell sizes:", [len(l) for l in ivf.lists])

Output:

cell sizes: [126, 144, 99, 109, 139, 139, 119, 125]

Every vector landed in exactly one of the 8 lists, ~125 each. A query now scans a couple of lists (~250 vectors) instead of all 1000.

The nprobe dial

nprobe trades recall for speed, without rebuilding:

  • nprobe = 1 — fastest, but misses neighbors that fall just across a region border.
  • larger nprobe — scans more regions, higher recall, more work.

In the demo you'll see IVF hit recall 1.0 while scanning ~6% of the data (nprobe=8 of nlist=128). That's the whole point: near-exact answers at a fraction of the cost.

What IVF does not do

IVF keeps the full vectors — it saves time, not memory. A billion vectors still need terabytes. To shrink them, we need quantization, which is next. 👉