分散的自行车:从高温检查站培训混合专家 (Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints)

Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only ~50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget.

翻译：大量深层神经网络的整合培训可能极其昂贵。结果,通常只有一小部分广受欢迎的密集型号模型在不同的背景和任务中被重新利用。越来越多的零星激活型模型,试图将模型大小与计算成本脱钩,正在成为密集型模型的诱人替代物。虽然在质量和计算成本方面效率更高,但稀有型模型仍然缺乏数据,在大规模体制下,从零开始培训费用只有50%左右。在这项工作中,我们提出了稀有的循环式模型,这是通过从密集的检查站启动一个微小的启动型混合专家模型来再利用稀释培训成本的简单方法。我们显示,稀薄的循环型T5基地、大语言和XL语言模型和视野变异型模型以及大型模型分别大大超过其在超级GLUE和图像网的密集型模型,只使用了初始密集培训前沉没成本的大约50%。循环型模型也超越了从零到初始密集培训前计算预算的100%的零碎型模型。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日