Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).
翻译:规模化在自然语言处理方面开辟了新的疆界 -- -- 但成本很高。 作为回应,提出了将专家混合(MoE)和开关变换器作为更大规模和更有能力的语言模型的一种节能路径。但是,在一系列广泛的自然语言任务中,由于在微调过程中培训不稳定性和质量不确定,阻碍了在一系列广泛的自然语言任务中推进最先进的技术。我们的工作侧重于这些问题并起到设计指南的作用。我们通过将稀疏的模型缩放到269B参数来得出结论,计算成本相当于32B密集的编码变异器(StabQA,自然问题)和对抗性构建的任务(Winogrande,ANLI R3),一个稀疏的模型首次在一系列任务中取得了最先进的传输学习成绩。这些任务包括推理(SuperGLUE, ARC Easy, ARC挑战),总和(XSum,CNN-DM),封闭的书籍问题解答(WebQA,自然问题),以及对抗性构建的任务(Winogrand,ANR3)。