There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact. The goal of this paper is to present scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presents a comprehensive study of the scaling behaviour of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm. The key findings of this paper are as follows: (1) we show that aside from only the model size, model shape matters for downstream fine-tuning, (2) scaling protocols operate differently at different compute regions, (3) widely adopted T5-base and T5-large sizes are Pareto-inefficient. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50\% fewer parameters and training 40\% faster compared to the widely adopted T5-base model. We publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.
翻译:有关变异器结构的缩放行为还存在许多尚未解决的问题。这些缩放决定和结论可能至关重要,因为培训往往附带具有财务和/或环境影响的计算成本。本文件的目的是介绍前培训和微调变异器的缩放见解。虽然卡普兰等人对变异器语言模型的缩放行为进行了全面研究,但范围仅局限于上游(预设)损失。因此,尚不清楚这些结果是否转移到了前培训模式范围内的下游任务。本文件的主要结论如下:(1) 除了模型规模、下游微调的模型形状事项之外,我们展示了在不同的配置区域,扩大协议的运作方式不同,(3)广泛采用的T5基和T5大尺寸是无效的。为此,我们提出了改进的缩放协议,使我们重新设计的模型达到类似的下游微调质量,而比广泛采用的T5基模型的参数和培训速度要快50 ⁇ 。我们公开发布了100多个T5配置的预先训练前检查站,以便利今后的研究与分析。