Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a deep neural network (DNN) model on a single device or using data parallelism. Still, they may not be flexible or efficient enough in training emerging large models on distributed devices, which require more sophisticated parallelism beyond data parallelism. Plugins or wrappers have been developed to strengthen these frameworks for model or pipeline parallelism, but they complicate the usage and implementation of distributed deep learning. Aiming at a simple, neat redesign of distributed deep learning frameworks for various parallelism paradigms, we present OneFlow, a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than existing frameworks, and the actor model provides a succinct runtime mechanism to manage the complex dependencies imposed by resource constraints, data movement and computation in distributed deep learning. We demonstrate the general applicability and efficiency of OneFlow for training various large DNN models with case studies and extensive experiments. The results show that OneFlow outperforms many well-known customized libraries built on top of the state-of-the-art frameworks. The code of OneFlow is available at: https://github.com/Oneflow-Inc/oneflow.
翻译:TensorFlow 和 PyTorrch 等深层次学习框架为表达和培训关于单一装置或使用数据平行模型的深层神经网络模型提供了一个有成果的界面。不过,在培训分布式设备的新大型模型方面,它们可能不够灵活或效率,这要求比数据平行主义更复杂的平行主义。已经开发了插管或包装器,以加强这些模型或管道平行主义框架,但它们使分布式深层次学习的使用和实施复杂化。我们的目标是为各种平行模式的分布式深层学习框架进行简单、整洁的重新设计,我们介绍一个基于SBP(样板、广播和部分价值)抽象和行为者模型的新型分布式培训框架OneFlow。SBBP比现有框架更容易编制数据平行主义和模型,而行为者模型提供了一个简明的运行时间机制,用以管理资源限制、数据流动和计算在分布式深层次学习中带来的复杂依赖性。我们展示了OneFlow在培训各种大型 DNN模型和广泛实验中的一般适用性和效率。结果显示,OFlow-Froom-Froom Froom Froom 正在建立许多成熟的定制式数据库。