Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.
翻译:由于硬件资源有限,培训深层次学习模式的目标通常是在培训和推理的时间和记忆限制条件下,最大限度地提高准确性。我们研究模型规模在这种环境下的影响,侧重于受计算限制的NLP任务的变异模型:自我监督的预训和高资源机器翻译。我们首先表明,即使较小的变异模型在每次迭代中执行更快,但较大和更深层次的模型会以大大更少的步骤相融合。此外,这种趋同速度的加速通常超过使用较大模型的额外计算间接费用。因此,最计算有效的培训战略是反目地培训非常大的模型,但在少量迭代后停止。这导致大型变异模型的培训效率与小型变异模型的推论效率之间的明显权衡。然而,我们表明,大型模型比小模型更强大,压缩技术(例如四分化和运行)比小模型更能压缩。因此,可以取得两个世界的最佳效果:大压缩、大模型比轻压缩、小模型的精度更高精度。