Scaling up deep neural networks has been proven effective in improving model quality, while it also brings ever-growing training challenges. This paper presents Whale, an automatic and hardware-aware distributed training framework for giant models. Whale generalizes the expression of parallelism with four primitives, which can define various parallel strategies, as well as flexible hybrid strategies including combination and nesting patterns. It allows users to build models at an arbitrary scale by adding a few annotations and automatically transforms the local model to a distributed implementation. Moreover, Whale is hardware-aware and highly efficient even when training on GPUs of mixed types, which meets the growing demand of heterogeneous training in industrial clusters. Whale sets a milestone for training the largest multimodal pretrained model M6. The success of M6 is achieved by Whale's design to decouple algorithm modeling from system implementations, i.e., algorithm developers can focus on model innovation, since it takes only three lines of code to scale the M6 model to trillions of parameters on a cluster of 480 GPUs.
翻译:提高深层神经网络在提高模型质量方面已证明是有效的,同时也带来了不断增长的培训挑战。本文件展示了鲸鱼,这是一个针对巨型模型的自动和硬件分布式培训框架。鲸鱼概括地展示了与四个原始生物平行的表达方式,这可以确定各种平行战略,以及灵活的混合战略,包括组合和筑巢模式。它允许用户通过增加几个说明,任意地建立模型,并将本地模型自动转换为分布式实施。此外,鲸鱼是硬件,而且即使在关于混合类型GPU的培训满足工业集群中多样化培训日益增长的需求时,也非常高效。鲸鱼为培训最大的多式预先培训模式M6树立了一个里程碑。M6的M6成功通过捕鲸设计从系统实施中分离算法模型,即算法开发者可以侧重于模型创新,因为将M6模型模型规模从480个组合的数万亿个参数缩小只需要三行代码。