Scalar quantization
Before the powerful Product Quantization, here's the simplest compression of all — scalar quantization (SQ) — which already gives a 4× memory cut and introduces the core idea: store small integers instead of floats.
The idea
A 32-bit float takes 4 bytes. But embedding values usually live in a modest range (say roughly $-1$ to $1$). If we map that range onto the 256 integers $0..255$, we can store each number in a single byte — a 4× reduction — losing only a little precision.
For each dimension we find its min and max across the dataset, then:
$$ \text{code} = \mathrm{round}\!\left( 255 \cdot \frac{x - \min}{\max - \min} \right), \qquad x \approx \min + \frac{\text{code}}{255}\,(\max - \min). $$
The first formula compresses a float to a byte; the second reconstructs an approximate float from the byte.
The code
import numpy as np
class ScalarQuantizer:
def train(self, X):
self.lo = X.min(axis=0) # per-dimension min
self.hi = X.max(axis=0) # per-dimension max
self.scale = np.maximum(self.hi - self.lo, 1e-9)
return self
def encode(self, X): # floats -> bytes
u = (X - self.lo) / self.scale # to 0..1
return np.clip(np.round(u * 255), 0, 255).astype(np.uint8)
def decode(self, codes): # bytes -> approx floats
return self.lo + (codes.astype(np.float64) / 255.0) * self.scale
See it work
import numpy as np
rng = np.random.default_rng(0)
X = rng.uniform(-1, 1, size=(1000, 8))
sq = ScalarQuantizer().train(X)
codes = sq.encode(X)
recon = sq.decode(codes)
print("original :", np.round(X[0], 3).tolist())
print("codes(byte):", codes[0].tolist())
print("decoded :", np.round(recon[0], 3).tolist())
print("bytes/vec : float32", X.shape[1] * 4, "-> SQ", codes.shape[1],
f"({X.shape[1] * 4 // codes.shape[1]}x smaller)")
print("avg abs error:", round(float(np.abs(X - recon).mean()), 4))
Output:
original : [0.274, -0.46, -0.918, -0.967, 0.627, 0.826, 0.213, 0.459]
codes(byte): [162, 69, 10, 4, 207, 233, 155, 186]
decoded : [0.273, -0.459, -0.917, -0.968, 0.624, 0.823, 0.215, 0.457]
bytes/vec : float32 32 -> SQ 8 (4x smaller)
avg abs error: 0.0019
Each float became one byte (4× smaller), and the reconstruction is off by only
~0.002 — usually negligible for nearest-neighbor ranking. (This is FAISS's SQ8.)
Why we need more than SQ
SQ is great and widely used (FAISS calls it SQ8), but it tops out at ~4×
because it still keeps one byte per dimension. For a 768-dim vector that's
768 bytes — better than 3072, but a billion of them is still ~768 GB.
To go further — 16× to 64× — we must stop treating dimensions independently and compress groups of them together. That's Product Quantization, the heart of this book. 👉