大型、然后是压缩列车:重新思考变压器高效培训和推断的示范规模 (Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers)

Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.

翻译：由于硬件资源有限,培训深层次学习模式的目标通常是在培训和推理的时间和记忆限制条件下,最大限度地提高准确性。我们研究模型规模在这种环境下的影响,侧重于受计算限制的NLP任务的变异模型:自我监督的预训和高资源机器翻译。我们首先表明,即使较小的变异模型在每次迭代中执行更快,但较大和更深层次的模型会以大大更少的步骤相融合。此外,这种趋同速度的加速通常超过使用较大模型的额外计算间接费用。因此,最计算有效的培训战略是反目地培训非常大的模型,但在少量迭代后停止。这导致大型变异模型的培训效率与小型变异模型的推论效率之间的明显权衡。然而,我们表明,大型模型比小模型更强大,压缩技术(例如四分化和运行)比小模型更能压缩。因此,可以取得两个世界的最佳效果:大压缩、大模型比轻压缩、小模型的精度更高精度。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

在线变分推断，76页ppt，A Regret Bound for Online Variational Inference

专知会员服务

21+阅读 · 2019年12月2日

【课程】普林斯顿大学19年春季学期《机器学习优化》课程讲义

专知会员服务

85+阅读 · 2019年10月29日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日