MiLMo: 少数民族多语种预训练语言模型 (MiLMo:Minority Multilingual Pre-trained Language Model)

Pre-trained language models are trained on large-scale unsupervised data, and they can fine-turn the model only on small-scale labeled datasets, and achieve good results. Multilingual pre-trained language models can be trained on multiple languages, and the model can understand multiple languages at the same time. At present, the search on pre-trained models mainly focuses on rich resources, while there is relatively little research on low-resource languages such as minority languages, and the public multilingual pre-trained language model can not work well for minority languages. Therefore, this paper constructs a multilingual pre-trained model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To solve the problem of scarcity of datasets on minority languages and verify the effectiveness of the MiLMo model, this paper constructs a minority multilingual text classification dataset named MiTC, and trains a word2vec model for each language. By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages. The final experimental results show that the performance of the pre-trained model is better than that of the word2vec model, and it has achieved the best results in minority multilingual text classification. The multilingual pre-trained model MiLMo, multilingual word2vec model and multilingual text classification dataset MiTC are published on http://milmo.cmli-nlp.com/.

翻译：预训练语言模型是在大规模无监督数据上进行训练的，它们只需要在小规模标注数据集上进行微调，就能取得良好的结果。多语种预训练语言模型可以在多种语言上进行训练，模型可以同时理解多种语言。目前，预训练模型的研究主要集中在资源丰富的语言上，而很少有关于少数民族语言等低资源语言的研究，公共的多语种预训练语言模型无法很好地处理少数民族语言。因此，本文构建了一个名为MiLMo的多语种预训练模型，在包括蒙古语、藏语、维吾尔语、哈萨克语和朝鲜语在内的少数民族语言任务上表现更好。为了解决少数民族语言数据集稀缺的问题，并验证MiLMo模型的有效性，本文构建了一个少数民族多语言文本分类数据集MiTC，并为每种语言训练了一个word2vec模型。通过比较word2vec模型和预训练模型在文本分类任务中的表现，本文为少数民族语言的下游任务研究提供了最优方案。最终的实验结果表明，预训练模型的性能优于word2vec模型，在少数民族多语言文本分类中取得了最佳结果。多语种预训练模型MiLMo、多语种word2vec模型和多语言文本分类数据集MiTC已发布在http://milmo.cmli-nlp.com/。