Quantization plays a crucial role in accelerating the inference of large-scale models, and rotational matrices have been shown to effectively improve quantization performance by smoothing outliers. However, end-to-end fine-tuning of rotational optimization algorithms incurs high computational costs and is prone to overfitting. To address this challenge, we propose an efficient distribution-aware rotational calibration method, DartQuant, which reduces the complexity of rotational optimization by constraining the distribution of the activations after rotation. This approach also effectively reduces reliance on task-specific losses, thereby mitigating the risk of overfitting. Additionally, we introduce the QR-Orth optimization scheme, which replaces expensive alternating optimization with a more efficient solution. In a variety of model quantization experiments, DartQuant demonstrates superior performance. Compared to existing methods, it achieves 47$\times$ acceleration and 10$\times$ memory savings for rotational optimization on a 70B model. Furthermore, it is the first to successfully complete rotational calibration for a 70B model on a single 3090 GPU, making quantization of large language models feasible in resource-constrained environments. Code is available at https://github.com/CAS-CLab/DartQuant.git.
翻译:量化在加速大规模模型推理中起着关键作用,而旋转矩阵已被证明能通过平滑异常值有效提升量化性能。然而,旋转优化算法的端到端微调计算成本高昂,且易导致过拟合。为应对这一挑战,本文提出一种高效的分布感知旋转校准方法DartQuant,该方法通过约束旋转后激活值的分布来降低旋转优化的复杂度。此方法亦有效减少对任务特定损失的依赖,从而缓解过拟合风险。此外,我们引入QR-Orth优化方案,以更高效的求解方式替代昂贵的交替优化。在多种模型量化实验中,DartQuant展现出优越性能:相较于现有方法,其在70B模型上实现了旋转优化47倍的加速与10倍的内存节省,并首次在单张3090 GPU上成功完成70B模型的旋转校准,使得大语言模型在资源受限环境中的量化成为可行。代码已发布于https://github.com/CAS-CLab/DartQuant.git。