Google’s TurboQuant algorithm compresses the KV cache (the RAM-hungry memory core of every large language model) by 6x, with no accuracy trade-off and no model retraining required. It operates near the theoretical lower bound of what information theory says is even possible.
This is not incremental. If adopted at scale, the economics of AI inference flip.
adlrocha’s writeup is the clearest explanation I’ve seen of why this matters.