The number of parameters in state of the art neural networks has drastically increased in recent years. This surge of interest in large scale neural networks has motivated the development of new distributed training strategies enabling such models. One such strategy is model-parallel distributed training. Unfortunately, model-parallelism can suffer from poor resource utilisation, which leads to wasted resources. In this work, we improve upon recent developments in an idealised model-parallel optimisation setting: local learning. Motivated by poor resource utilisation in the global setting and poor task performance in the local setting, we introduce a class of intermediary strategies between local and global learning referred to as interlocking backpropagation. These strategies preserve many of the compute-efficiency advantages of local optimisation, while recovering much of the task performance achieved by global optimisation. We assess our strategies on both image classification ResNets and Transformer language models, finding that our strategy consistently out-performs local learning in terms of task performance, and out-performs global learning in training efficiency.
翻译:近年来,艺术神经网络状态参数的数量急剧增加。对大规模神经网络的兴趣激增,促使开发了有利于这些模型的新分布式培训战略。这种战略之一是模型平行分布式培训。不幸的是,模型平行主义可能因资源利用不足而受害,从而导致资源的浪费。在这项工作中,我们改进了理想化模型平行优化环境的最近发展情况:当地学习。由于在全球环境中资源利用不足和地方环境中任务表现差,我们引进了一套地方和全球学习之间的中间战略,被称为相互交错的反向宣传。这些战略保留了地方优化的许多计算效率优势,同时恢复了全球优化所实现的大部分任务绩效。我们评估了我们关于图像分类ResNets和变异语言模式的战略,发现我们的战略始终在任务表现方面优于地方学习,在培训效率方面超越了全球学习。