Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, for LLMs beyond 100 billion parameters, existing methods cannot maintain accuracy or do not run efficiently on hardware. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT-175B, BLOOM-176B, GLM-130B, and MT-NLG 530B. SmoothQuant has better hardware efficiency than existing techniques. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy. We integrate SmoothQuant into FasterTransformer, a state-of-the-art LLM serving framework, and achieve faster inference speed with half the number of GPUs compared to FP16, enabling the serving of a 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs. Code is available at https://github.com/mit-han-lab/smoothquant.
翻译:大型语言模型(LLMS) 表现优异,但可进行计算和记忆密集。 量化可以减少内存并加速推断。 但是,对于超过1000亿参数的LLMS, 现有方法无法保持准确性, 或无法在硬件上有效运行。 我们建议平准Quat, 这是一种无培训、准确性保存和通用的训练后量化(PTQ) 解决方案, 使LMS的重量达到8比重, 8比特激活(W8A8) 和记忆密集度。 基于权重在启动时容易量化,而启动速度不快。 平准Quat通过从启动到数学等量变换的重量的重量调离线将振动困难平平平下启动。 平准QualQPMS, 将INT8的重量平分解和LMS的所有矩阵倍增倍化, 包括OM-176B、GLM-130B和M-NT-NLG OFG 530B。 平准度平时比现有技术的硬件效率效率效率更高,, 我们的平流- mex- massal- mex- deal-molt-moxxxxxxxxx 快速化了我们快速缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩框架框架框架。