Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only ~50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget.
翻译:大量深层神经网络的整合培训可能极其昂贵。 结果,通常只有一小部分广受欢迎的密集型号模型在不同的背景和任务中被重新利用。 越来越多的零星激活型模型,试图将模型大小与计算成本脱钩,正在成为密集型模型的诱人替代物。 虽然在质量和计算成本方面效率更高,但稀有型模型仍然缺乏数据,在大规模体制下,从零开始培训费用只有50%左右。在这项工作中,我们提出了稀有的循环式模型,这是通过从密集的检查站启动一个微小的启动型混合专家模型来再利用稀释培训成本的简单方法。我们显示,稀薄的循环型T5基地、大语言和XL语言模型和视野变异型模型以及大型模型分别大大超过其在超级GLUE和图像网的密集型模型,只使用了初始密集培训前沉没成本的大约50%。 循环型模型也超越了从零到初始密集培训前计算预算的100%的零碎型模型。