Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.
翻译:语言建模的近期趋势侧重于通过扩大规模提高绩效,并导致形成一个培训语言模型对大多数研究人员和从业者来说都无法触及的环境。虽然大多数社区的人都在询问如何推展极端计算的限制,但我们却问一个相反的问题:仅仅一天,我们用一个基于变压器的语言模型来调查下游的绩效,该模型完全从零开始训练,在单一消费者GPU上只用蒙面语言建模一天。除了重新分析这一设想方案的培训前管道的几乎所有组成部分和提供一种与BERT相近的经修改的管道之外,我们还要调查为什么缩小规模是困难的,以及哪些修改实际上改进了这一设想的绩效。我们提供的证据是,即使在这种有限的设置中,绩效也紧紧随在大规模组合环境中观察到的法律的缩放。我们从缩放法的角度,对培训和建筑的近期一系列改进进行了分类,并讨论了这些改进(或缺乏)对有限的校准环境的优点和实际适用性(或缺乏)。