TurboQuant: Google's New Compression Tricks That Actually Work

Google Research just dropped three new compression algorithms — TurboQuant, Quantized Johnson-Lindenstrauss (QJL), and PolarQuant — and honestly, they’re more interesting than most of the quantization papers I’ve read this year.

Vectors are the backbone of how AI models think. Small ones represent simple stuff like coordinates, high-dimensional ones capture complex things like image features or word meanings. The problem? High-dimensional vectors eat memory for breakfast. They’re the main reason your key-value cache gets bloated and your attention mechanism slows to a crawl.

Vector quantization has been the go-to fix for this, but it has a dirty secret: traditional methods add their own memory overhead. Most quantizers need to calculate and store full-precision constants for every tiny block of data. That’s 1 or 2 extra bits per number, which partially defeats the purpose of compressing in the first place.

TurboQuant, which is being presented at ICLR 2026, tackles this head-on. It’s a compression method that claims to reduce model size with zero accuracy loss. In testing, it showed real promise for unclogging KV-cache bottlenecks without sacrificing performance.

How the sausage is made

TurboQuant works in two stages. First, it randomly rotates the data vectors — a clever geometric trick that simplifies the data’s shape, making it easier to apply a standard quantizer to each part. This first stage uses most of the bits to capture the main signal.

Then comes the interesting part. TurboQuant takes the leftover error — the tiny residual from the first compression — and throws just 1 bit at it using the QJL algorithm. This acts as a mathematical error-checker that eliminates bias, leading to more accurate attention scores.

QJL: One bit to rule them all

QJL uses the Johnson-Lindenstrauss Transform to shrink high-dimensional data while preserving the essential distances between points. It reduces each vector number to a single sign bit (+1 or -1). Zero memory overhead. To maintain accuracy, it uses a special estimator that balances a high-precision query against the low-precision data. It’s elegant in its simplicity.

PolarQuant: A different angle

PolarQuant takes a completely different approach to the memory overhead problem. Instead of representing vectors using standard Cartesian coordinates (X, Y, Z), it converts them into polar coordinates — angle and magnitude. This representation naturally requires less memory because angles can be quantized more aggressively than raw coordinates.

The combination of these three methods is what makes TurboQuant interesting. PolarQuant handles the heavy lifting, QJL cleans up the residuals, and the result is a compression pipeline that actually works without the usual trade-offs.

What this means in practice

For anyone working with large language models or vector search engines, this is worth paying attention to. The KV-cache bottleneck is one of the most annoying practical problems in serving large models. Every bit of compression you can squeeze out without accuracy loss translates directly to lower memory costs and faster inference.

I’ve seen a lot of quantization papers that look good on paper but fall apart in real deployments. The fact that Google is presenting this at ICLR 2026 and AISTATS 2026 suggests they’ve done their homework. The zero-overhead claim for QJL is particularly appealing — most “lossless” compression techniques have hidden costs.

That said, I’m curious to see how these algorithms perform on real production workloads, not just benchmark datasets. The math is sound, but implementation details always matter. I’ll be keeping an eye on any open-source releases or follow-up papers that show actual deployment numbers.

For now, this is one of the more practical contributions to model compression I’ve seen in a while. No hype, no vague promises — just solid math and a clear explanation of what they solved.

TurboQuant: Google’s New Compression Tricks That Actually Work

How the sausage is made

QJL: One bit to rule them all

PolarQuant: A different angle

What this means in practice

Comments (0)