Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Metas SpinQuant (perplexity 7.3), OstQuant (7.3) and QuaRot (8.2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant.
翻译:训练后量化已成为高效部署大型语言模型的关键技术。本研究提出NestQuant——一种基于自相似嵌套格点的权重与激活值新型训练后量化方案。近期研究从数学上证明此类量化器在低精度矩阵乘法中具有信息论最优性。我们基于Gosset格点实现了低复杂度的实用版NestQuant,使其可作为即插即用量化器适用于任何矩阵乘法步骤(例如自注意力机制、多层感知机等)。以Llama-3-8B模型为例,NestQuant将权重、KV缓存和激活值量化为4比特,在wikitext2数据集上获得6.6的困惑度。相较于未量化模型(困惑度6.14),该结果较当前最先进的Meta SpinQuant(困惑度7.3)、OstQuant(7.3)和QuaRot(8.2)减少了超过55%的困惑度差距。在更大规模模型(最高达700亿参数)及多种LLM评估基准上的对比实验,均证实了NestQuant具有一致性的优越性能。