Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2x. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.
翻译:分散活跃的专家混合(MoE) 模型允许参数数量大幅增加,同时保持给定物或给定样品的计算数量不变。然而,一个差劲的专家路线战略(例如造成负载不平衡的一种战略)可能导致某些专家接受的培训不足,导致专家处于专业水平之下或过度专业化状态。先前的工作使用一个顶级函数将固定数量的专家分配给每个标志,而不管不同标志的相对重要性如何。为了解决这个问题,我们建议采用专家选择方法,使用一种混合的专家混合方法。我们有专家选择顶级专家的方法,而不是让标牌选择顶级专家,而是让专家选择顶级标码。结果,每种标码都可选择给一个可变数的专家,每位专家都可以有一个固定的桶尺寸。我们系统地研究培训前的进度,使用开关变压1和GShard顶2的相同计算资源,发现我们的方法使培训的趋同时间增加超过2x。对于同样的计算成本而言,我们的方法显示在7GUE中采用更高的标准,在GUI的精度调整11方法中显示比标准。