PALM: 使用路径的扩大语言建模 (PaLM: Scaling Language Modeling with Pathways)

Aakanksha Chowdhery,Sharan Narang,Jacob Devlin,Maarten Bosma,Gaurav Mishra,Adam Roberts,Paul Barham,Hyung Won Chung,Charles Sutton,Sebastian Gehrmann,Parker Schuh,Kensen Shi,Sasha Tsvyashchenko,Joshua Maynez,Abhishek Rao,Parker Barnes,Yi Tay,Noam Shazeer,Vinodkumar Prabhakaran,Emily Reif,Nan Du,Ben Hutchinson,Reiner Pope,James Bradbury,Jacob Austin,Michael Isard,Guy Gur-Ari,Pengcheng Yin,Toju Duke,Anselm Levskaya,Sanjay Ghemawat,Sunipa Dev,Henryk Michalewski,Xavier Garcia,Vedant Misra,Kevin Robinson,Liam Fedus,Denny Zhou,Daphne Ippolito,David Luan,Hyeontaek Lim,Barret Zoph,Alexander Spiridonov,Ryan Sepassi,David Dohan,Shivani Agrawal,Mark Omernick,Andrew M. Dai,Thanumalayan Sankaranarayana Pillai,Marie Pellat,Aitor Lewkowycz,Erica Moreira,Rewon Child,Oleksandr Polozov,Katherine Lee,Zongwei Zhou,Xuezhi Wang,Brennan Saeta,Mark Diaz,Orhan Firat,Michele Catasta,Jason Wei,Kathy Meier-Hellstern,Douglas Eck,Jeff Dean,Slav Petrov,Noah Fiedel

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

翻译：大型语言模型展示了在各种自然语言任务中取得出色业绩的大型语言模型,利用微小的学习,大大降低了使模型适应特定应用所需的具体任务培训范例的数量。为了进一步理解规模对微小学习的影响,我们培训了540亿个参数,即密集激活的变异语言模型,我们称之为“语言模型 PaLM ” 。我们用“路径”这一新的ML系统对PALM进行了6144 TPU v4芯片的培训,该系统使多个TPU Pods能够进行高效的减灾培训。我们通过在数百种语言理解和生成基准方面实现最先进的少见的学习成果,展示了扩大规模的优势。关于其中一些任务,PALM 540B取得了突破性业绩,在一系列多步推理任务上超越了微调的状态,在最新发布的BIG-Bench模型基准上,我们有大量的BIG-Bench任务,从模型上显示出了不连续的改进,这意味着随着我们向我们最大的模型扩展到最大规模语言理解和生成基准的绩效,我们从多语言模型上展示了强大的多语言模型和最可靠的数据分析。