更仔细地审视为分配培训而联合蒸馏 (A Closer Look at Codistillation for Distributed Training)

Codistillation has been proposed as a mechanism to share knowledge among concurrently trained models by encouraging them to represent the same function through an auxiliary loss. This contrasts with the more commonly used fully-synchronous data-parallel stochastic gradient descent methods, where different model replicas average their gradients (or parameters) at every iteration and thus maintain identical parameters. We investigate codistillation in a distributed training setup, complementing previous work which focused on extremely large batch sizes. Surprisingly, we find that even at moderate batch sizes, models trained with codistillation can perform as well as models trained with synchronous data-parallel methods, despite using a much weaker synchronization mechanism. These findings hold across a range of batch sizes and learning rate schedules, as well as different kinds of models and datasets. Obtaining this level of accuracy, however, requires properly accounting for the regularization effect of codistillation, which we highlight through several empirical observations. Overall, this work contributes to a better understanding of codistillation and how to best take advantage of it in a distributed computing environment.

翻译：提议将蒸馏法作为一种机制,用于在同时培训的模型之间分享知识,鼓励它们通过附带损失代表同样的功能。这与更常用的完全同步的数据平行梯度梯度下降方法形成对照,不同模型复制法在每次迭代中平均使用梯度(或参数),从而保持相同的参数。我们调查在分布式培训设置中蒸馏法,以补充以往侧重于极大批量规模的工作。令人惊讶的是,我们发现,尽管使用较弱的同步机制,但即使是中等批量规模的,经过蒸馏培训的模型也可以使用同步数据平行方法,以及经过同步方法培训的模型。这些结果存在于一系列批量规模和学习率时间表以及不同的模型和数据集中。然而,要达到这一准确度,就需要对蒸馏法的正规化效果进行适当核算,我们通过若干经验观测来强调这一点。总体而言,这项工作有助于更好地了解蒸馏法的积累以及如何在分布式计算环境中充分利用它。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/