How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements. In this work, we present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant. ZeroQuant is an end-to-end quantization and inference pipeline with three main components: (1) a fine-grained hardware-friendly quantization scheme for both weight and activations; (2) a novel affordable layer-by-layer knowledge distillation algorithm (LKD) even without the access to the original training data; (3) a highly-optimized quantization system backend support to remove the quantization/dequantization overhead. As such, we are able to show that: (1) ZeroQuant can reduce the precision for weights and activations to INT8 in a cost-free way for both BERT and GPT3-style models with minimal accuracy impact, which leads to up to 5.19x/4.16x speedup on those models compared to FP16 inference; (2) ZeroQuant plus LKD affordably quantize the weights in the fully-connected module to INT4 along with INT8 weights in the attention module and INT8 activations, resulting in 3x memory footprint reduction compared to the FP16 model; (3) ZeroQuant can be directly applied to two of the largest open-sourced language models, including GPT-J6B and GPT-NeoX20, for which our INT8 model achieves similar accuracy as the FP16 model but achieves up to 5.2x better efficiency.
翻译:在这项工作中,我们为压缩大型变压器模型(称为ZeroQuant)提供了一种高效和负担得起的后培训量化方法。 零Quant是一个端到端的量化和推断管道,包括三个主要组成部分:(1) 精巧的硬件友好式重量和启动量量化计划;(2) 新的、可负担得起的逐层知识蒸馏算法(LKD),即使没有原始培训数据的访问;(3) 高优化的四分解系统后端支持以取消四分化/分解模型,称为ZeroQuant。 因此,我们可以表明:(1) 零Quant能够降低重量的精确度,以无成本的方式将INTF20模式启动到I20模式,对BERT和GPT6型最低精确影响;(2) 最高可达5.19x/4x的逐层知识蒸馏算法(LK4x) 快速更新这些模型,从而将GFTF8和NTFS模块全面降低成本;(3) 将IFTF8的精度模型和NT16模型降低成本。