Scalar quantization

Before the powerful Product Quantization, here's the simplest compression of all — scalar quantization (SQ) — which already gives a 4× memory cut and introduces the core idea: store small integers instead of floats.

The idea

A 32-bit float takes 4 bytes. But embedding values usually live in a modest range (say roughly $-1$ to $1$). If we map that range onto the 256 integers $0..255$, we can store each number in a single byte — a 4× reduction — losing only a little precision.

For each dimension we find its min and max across the dataset, then:

$$ \text{code} = \mathrm{round}\!\left( 255 \cdot \frac{x - \min}{\max - \min} \right), \qquad x \approx \min + \frac{\text{code}}{255}\,(\max - \min). $$

The first formula compresses a float to a byte; the second reconstructs an approximate float from the byte.

The code

import numpy as np


class ScalarQuantizer:
    def train(self, X):
        self.lo = X.min(axis=0)                 # per-dimension min
        self.hi = X.max(axis=0)                 # per-dimension max
        self.scale = np.maximum(self.hi - self.lo, 1e-9)
        return self

    def encode(self, X):                        # floats -> bytes
        u = (X - self.lo) / self.scale          # to 0..1
        return np.clip(np.round(u * 255), 0, 255).astype(np.uint8)

    def decode(self, codes):                    # bytes -> approx floats
        return self.lo + (codes.astype(np.float64) / 255.0) * self.scale

See it work

import numpy as np

rng = np.random.default_rng(0)
X = rng.uniform(-1, 1, size=(1000, 8))
sq = ScalarQuantizer().train(X)
codes = sq.encode(X)
recon = sq.decode(codes)

print("original  :", np.round(X[0], 3).tolist())
print("codes(byte):", codes[0].tolist())
print("decoded   :", np.round(recon[0], 3).tolist())
print("bytes/vec : float32", X.shape[1] * 4, "-> SQ", codes.shape[1],
      f"({X.shape[1] * 4 // codes.shape[1]}x smaller)")
print("avg abs error:", round(float(np.abs(X - recon).mean()), 4))

Output:

original  : [0.274, -0.46, -0.918, -0.967, 0.627, 0.826, 0.213, 0.459]
codes(byte): [162, 69, 10, 4, 207, 233, 155, 186]
decoded   : [0.273, -0.459, -0.917, -0.968, 0.624, 0.823, 0.215, 0.457]
bytes/vec : float32 32 -> SQ 8 (4x smaller)
avg abs error: 0.0019

Each float became one byte (4× smaller), and the reconstruction is off by only ~0.002 — usually negligible for nearest-neighbor ranking. (This is FAISS's SQ8.)

Why we need more than SQ

SQ is great and widely used (FAISS calls it SQ8), but it tops out at ~4× because it still keeps one byte per dimension. For a 768-dim vector that's 768 bytes — better than 3072, but a billion of them is still ~768 GB.

To go further — 16× to 64× — we must stop treating dimensions independently and compress groups of them together. That's Product Quantization, the heart of this book. 👉