Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3, with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of few-shot (even zero-shot) learning. However, applying GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus of GPT-3 is primarily English, and the parameters are not publicly available. In this technical report, we release the Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data. To the best of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained language model, which could facilitate several downstream Chinese NLP tasks, such as conversation, essay generation, cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many NLP tasks in the settings of few-shot (even zero-shot) learning. The code and parameters are available at https://github.com/TsinghuaAI/CPM-Generate.
翻译:培训前语言模式(PLM)对下游国家学习计划的各项任务十分有益。最近,GPT-3(GPT-3),有750亿参数和570GB培训数据,由于少发(甚至零发)学习能力,吸引了大量关注,然而,运用GPT-3(GPT-3)处理中国学习计划任务仍具有挑战性,因为GPT-3(GPT-3)的训练资料主要是英文,各项参数没有公开提供。在本技术报告中,我们发布了中国培训前语言模式(CPM),并配有大规模中国培训数据的基因化预设培训预设培训。就我们的知识而言,26亿参数和100GB中国培训数据,是中国培训前最大的中文语言模式,可促进下游中国学习计划的若干任务,如谈话、作文、感应测试和语言理解。广泛的实验表明,在少发(甚至零发)学习环境中的许多国家学习计划任务成绩良好。代码和参数见https://github.com/TsinghuinghugaAI/CPM-Generate。