Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models. On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1.0 BLEU on average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9x when we route by tasks instead of tokens. While distilling a token-MoE to a smaller dense model preserves only 32% of the BLEU gains, our sub-network task-MoE, by design, preserves all the gains with the same inference cost as the distilled student model. Finally, when scaling up to 200 language pairs, our 128-expert task-MoE (13B parameters) performs competitively with a token-level counterpart, while improving the peak inference throughput by a factor of 2.6x.
翻译:将多语种翻译模式推广到数十亿个参数而不按比例提高培训计算量的成功方法(MoE)一直是将多语种翻译模式推广到数十亿个参数的成功方法。然而,MOE模型规模巨大,令人望而却步,执业者往往采用诸如服务蒸馏等方法。在这项工作中,我们调查教育部模型中不同颗粒(制、句、任务)的路线战略,以绕过蒸馏。WMT和网络规模数据集实验显示,任务级路线(任务-模E)的峰值让我们能够从大分散的模型中提取较小、准备就绪的子网络。在WMTT上,我们的任务-模式,拥有32个专家(533M参数),常常采用比方标准级模型(Token-MOEEEEE),以平均+1.0 BLEUEU, 取代蒸馏。当我们的任务由符号而不是代,将象征性的模-MoEEE 提升到更精确的排序时,我们的任务级模型只能保留32 %的比值。