Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of language model to trillions of parameters. However, training trillion-scale MoE requires algorithm and system co-design for a well-tuned high performance distributed training system. Unfortunately, the only existing platform that meets the requirements strongly depends on Google's hardware (TPU) and software (Mesh Tensorflow) stack, and is not open and available to the public, especially GPU and PyTorch communities. In this paper, we present FastMoE, a distributed MoE training system based on PyTorch with common accelerators. The system provides a hierarchical interface for both flexible model design and easy adaption to different applications, such as Transformer-XL and Megatron-LM. Different from direct implementation of MoE models using PyTorch, the training speed is highly optimized in FastMoE by sophisticated high-performance acceleration skills. The system supports placing different experts on multiple GPUs across multiple nodes, enabling enlarging the number of experts linearly against the number of GPUs. The source of FastMoE is available at https://github.com/laekov/fastmoe under Apache-2 license.
翻译:专家混合(MOE)在将语言模型规模扩大至万亿参数方面有着巨大的潜力。然而,培训万万亿规模的MOE需要算法和系统共同设计,以建立一个协调良好的高性能分布培训系统。不幸的是,满足要求的唯一现有平台在很大程度上取决于谷歌硬件(TPU)和软件(Mesh Tensorflow)堆(Mesh Tensorflow),并且不向公众开放和开放,特别是GPU和PyTorch社区。在本文中,我们介绍了快速MoE,一个基于PyTorch与共同加速器的分布式MOE培训系统。该系统为灵活的模型设计和容易适应不同应用,例如变换器-XL和Megatron-LM提供了一个等级界面。不同于使用PyTorrch直接实施MOE模型,培训速度在快速MoE中通过复杂的高性能加速技能得到高度优化。该系统支持在多个节点上设置不同的专家,从而能够将专家人数从线上扩大到GPUPUS/coms。ASTASG/ASGWA的源。