NestQuant：基于嵌套格点的矩阵乘积与大型语言模型量化方法 (NestQuant: Nested Lattice Quantization for Matrix Products and LLMs)

Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Metas SpinQuant (perplexity 7.3), OstQuant (7.3) and QuaRot (8.2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant.

翻译：后训练量化已成为高效部署大型语言模型的关键技术。本研究提出NestQuant——一种基于自相似嵌套格点结构的权重与激活值量化新方案。近期研究从数学上证明此类量化器在低精度矩阵乘法中具有信息论最优性。我们基于Gosset格点实现了低复杂度实用版NestQuant，使其可作为即插即用量化模块适用于任意矩阵乘法步骤（如自注意力机制、多层感知机等）。以Llama-3-8B模型为例，NestQuant将权重、KV缓存和激活值量化至4比特，在wikitext2数据集上获得6.6的困惑度。相较于未量化模型（困惑度6.14），该结果较当前最先进的Meta SpinQuant（困惑度7.3）、OstQuant（7.3）和QuaRot（8.2）将困惑度差距缩小超55%。在更大规模模型（最高达700亿参数）及多种LLM评估基准上的对比实验，均证实了NestQuant具有一致优越性。