PanGu-$α$:大型自动递减、未受过训练的有自动平行计算法的中文语文模式 (PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation)

Wei Zeng,Xiaozhe Ren,Teng Su,Hui Wang,Yi Liao,Zhiwei Wang,Xin Jiang,ZhenZhang Yang,Kaisheng Wang,Xiaoda Zhang,Chen Li,Ziyan Gong,Yifan Yao,Xinjing Huang,Jun Wang,Jianfeng Yu,Qi Guo,Yue Yu,Yan Zhang,Jin Wang,Hengtao Tao,Dasen Yan,Zexuan Yi,Fang Peng,Fangqing Jiang,Han Zhang,Lingfeng Deng,Yehong Zhang,Zhe Lin,Chao Zhang,Shaojie Zhang,Mingyue Guo,Shanzhi Gu,Gaojun Fan,Yaowei Wang,Xuefeng Jin,Qun Liu,Yonghong Tian

from arxiv, The technique report for PanGu-$\alpha$

Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with \textit{few-shot in-context} learning. In this work, we present our practice on training large-scale autoregressive language models named PanGu-$\alpha$, with up to 200 billion parameters. PanGu-$\alpha$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the generalization ability of PanGu-$\alpha$, we collect 1.1TB high-quality Chinese data from a wide range of domains to pretrain the model. We empirically test the generation ability of PanGu-$\alpha$ in various scenarios including text summarization, question answering, dialogue generation, etc. Moreover, we investigate the effect of model scales on the few-shot performances across a broad range of Chinese NLP tasks. The experimental results demonstrate the superior capabilities of PanGu-$\alpha$ in performing various tasks under few-shot or zero-shot settings.

翻译：具有数千亿参数的PLMS,如GPT-3,在自然语言理解和生成方面表现出很强的成绩,并学习了\ textit{few-shot-comtext}学习。在这项工作中,我们介绍了我们培训大型自动递进语言模型的做法,名称为PanGu-$\alpha$,参数高达2 000亿美元。PanGu-$\alpha$是在MindSpore下开发的,在2048 Ascend 910 AI处理器的一组中接受培训。培训平行战略是以MindSpore Aut-parllel为基础实施的,它构成五个平行层面,将培训任务有效扩大到2048个处理器,包括数据平行、低级模型平行、最佳模式平行和再物质化。为了提高PanguPu-P$\alpha$的通用能力,我们收集了中国高品质数据,从广泛的域到最低模型的模型。我们实验性地测试了泛级和低级对话的生成能力。