We present MegaBlocks, a system for efficient Mixture-of-Experts (MoE) training on GPUs. Our system is motivated by the limitations of current frameworks, which restrict the dynamic routing in MoE layers to satisfy the constraints of existing software and hardware. These formulations force a tradeoff between model quality and hardware efficiency, as users must choose between dropping tokens from the computation or wasting computation and memory on padding. To address these limitations, we reformulate MoE computation in terms of block-sparse operations and develop new block-sparse GPU kernels that efficiently handle the dynamism present in MoEs. Our approach never drops tokens and maps efficiently to modern hardware, enabling end-to-end training speedups of up to 40% over MoEs trained with the state-of-the-art Tutel library and 2.4x over DNNs trained with the highly-optimized Megatron-LM framework.
翻译:我们提出MegaBlocks, 这是一种在 GPUs 上进行高效混合专家培训的系统。 我们的系统受当前框架的限制驱动,这些框架限制在MOE层次的动态路径,以满足现有软件和硬件的限制。 这些配方迫使模型质量和硬件效率之间的权衡,因为用户必须选择从计算中丢弃符号或浪费计算和在垫上存储记忆。 为了解决这些限制,我们重新配置以块分割操作为单位的计算方法,并开发新的块分割的GPU内核,以高效处理教育部现有的动态。 我们的方法从来不向现代硬件丢下标码和地图,从而能够将40%的终端到终端培训速度超过在最先进的Tutel 高级图书馆接受培训的MOE和在高度精准的Megatron-LM 框架下培训的DNNS 2.4x。