Recent work in language modeling has shown that training large-scale Transformer models has promoted the latest developments in natural language processing applications. However, there is very little work to unify the current effective models. In this work, we use the current effective model structure to launch a model set through the current most mainstream technology. We think this will become the basic model in the future. For Chinese, using the GPT-2[9] model, a 10.3 billion parameter language model was trained on the Chinese dataset, and, in particular, a 2.9 billion parameter language model based on dialogue data was trained; the BERT model was trained on the Chinese dataset with 495 million parameters; the Transformer model has trained a language model with 5.6 billion parameters on the Chinese dataset. In English, corresponding training work has also been done. Using the GPT-2 model, a language model with 6.4 billion parameters was trained on the English dataset; the BERT[3] model trained a language model with 1.24 billion parameters on the English dataset, and in particular, it trained a 688 million parameter based on single card training technology Language model; Transformer model trained a language model with 5.6 billion parameters on the English dataset. In the TNEWS classification task evaluated by CLUE[13], the BERT-C model exceeded the 59.46% accuracy of ALBERT-xxlarge with an accuracy rate of 59.99%, an increase of 0.53%. In the QQP classification task evaluated by GLUE[11], the accuracy rate of 78.95% surpassed the accuracy rate of BERT-Large of 72.1%, an increase of 6.85%. Compared with the current accuracy rate of ERNIE, the first place in the GLUE evaluation of 75.2%, an increase of 3.75%.
翻译:语言模型的近期工作表明,培训大型变换模型促进了自然语言处理应用程序的最新发展。然而,统一当前有效模型的工作很少。在这项工作中,我们使用当前有效的模型结构,通过当前最主流技术推出一个模型集。我们认为,这将成为未来的基本模型。对于中国人,使用GPT-2[9]模型,对1030亿参数语言模型进行了中国数据集培训,特别是以对话数据为基础的29亿参数语言模型进行了培训;对BERT模型进行了包含4.95亿参数的中国数据集培训;对变换模型进行了包含56亿参数的中国数据集培训。在这项工作中,我们用当前最主流技术模型启动了一个包含64亿参数的语言模型。 在英语数据集中,BERT[3]培训了一个包含12.4亿参数的语文模型,在对话数据模型中,特别是,根据单一卡培训技术模型,对6.88亿参数进行了培训;变换模型对一个包含5.10亿参数的RER-983的中文模型进行了包含5.0亿参数的中文数据集;在英语数据分类中,用5.10亿比率的TREAR-L的当前G值比率评估。