We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules that contain only a small fraction of the model parameters. However, it can be difficult to learn balanced routing functions that make full use of the available experts; existing approaches typically use routing heuristics or auxiliary expert-balancing loss functions. In contrast, we formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens. This optimal assignment scheme improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyperparameters or auxiliary losses. Code is publicly released at https://github.com/pytorch/fairseq/
翻译:我们为大型语言模型引入了一种新的平衡的专家分配(BASE)层次,大大简化了现有高容量稀疏层;粗浅层通过将每个象征性物都引导给只包含一小部分模型参数的专门专家模块,可以大大提高培训和推断的效率;然而,可能难以充分利用现有专家,学习平衡的路线功能;现有方法通常使用路由性超常性或辅助性专家平衡性损失功能;相反,我们将象征性专家分配作为一种线性分配问题,允许最佳分配方式,让每位专家获得同等数量的标码;这一最佳分配办法通过保证平衡的计算负荷来提高效率,并通过不要求任何新的超参数或辅助损失来简化培训。守则在https://github.com/pytorch/fairseq/上公开发布。