Network quantization has gained increasing attention with the rapid growth of large pre-trained language models~(PLMs). However, most existing quantization methods for PLMs follow quantization-aware training~(QAT) that requires end-to-end training with full access to the entire dataset. Therefore, they suffer from slow training, large memory overhead, and data security issues. In this paper, we study post-training quantization~(PTQ) of PLMs, and propose module-wise quantization error minimization~(MREM), an efficient solution to mitigate these issues. By partitioning the PLM into multiple modules, we minimize the reconstruction error incurred by quantization for each module. In addition, we design a new model parallel training strategy such that each module can be trained locally on separate computing devices without waiting for preceding modules, which brings nearly the theoretical training speed-up (e.g., $4\times$ on $4$ GPUs). Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
翻译:随着经过培训的大型语言模型的迅速增长,网络的量化得到了越来越多的关注。然而,目前对PLMS的现有量化方法大多采用量化认知培训(QAT)方法,这需要全能访问整个数据集的端对端培训(QAT),因此,它们受到缓慢的培训、记忆管理以及数据安全问题的困扰。在本文中,我们研究了对PLMS的培训后量化(PTQ)方法,并提出将模块性量化错误最小化(MREM),这是缓解这些问题的一个有效解决方案。通过将PLM分成多个模块,我们最大限度地减少每个模块的重组错误。此外,我们设计了一个新的模型平行培训战略,使每个模块可以在不等待前一个模块的情况下就单独的计算设备在当地接受培训,这几乎可以带来理论培训速度(例如,4美元对4美元GPUPUS)的提升。关于GLUE和SQOAD基准的实验表明,我们提议的PTQQ的解决方案不仅接近QAT,而且在培训时间、记忆、数据以及消费方面大幅减少。