Pre-trained language models are trained on large-scale unsupervised data, and they can be fine-tuned on small-scale labeled datasets and achieve good results. Multilingual pre-trained language models can be trained on multiple languages and understand multiple languages at the same time. At present, the research on pre-trained models mainly focuses on rich-resource language, while there is relatively little research on low-resource languages such as minority languages, and the public multilingual pre-trained language model can not work well for minority languages. Therefore, this paper constructs a multilingual pre-trained language model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To solve the problem of scarcity of datasets on minority languages and verify the effectiveness of the MiLMo model, this paper constructs a minority multilingual text classification dataset named MiTC, and trains a word2vec model for each language. By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages. The final experimental results show that the performance of the pre-trained model is better than that of the word2vec model, and it has achieved the best results in minority multilingual text classification. The multilingual pre-trained language model MiLMo, multilingual word2vec model and multilingual text classification dataset MiTC are published on https://milmo.cmli-nlp.com.
翻译:培训前语言模式的研究工作主要侧重于丰富的资源语言,而关于少数民族语言等低资源语言的研究相对较少,而公共多语种的预先培训语言模式对少数民族语言无法很好地发挥作用。因此,本文构建了一个多语种的预先培训语言模式,名为MILMO, 更好地执行少数民族语言的任务,包括蒙古语、藏语、维吾尔语、哈萨克语和韩语。为了解决少数民族语言数据集稀缺的问题,并核实MILMO模式的有效性,本文构建了一个少数民族语言的多语种文本分类数据集,名为MiTC, 并为每种语言培训了一个Word2vec模式。通过比较Word2vec模式和经过培训的文本分类模式,本文为少数群体语言的下游任务研究提供了最佳方案,包括蒙古语、藏语、维吾尔语、哈萨克语和韩语。最终实验结果显示,MiL语言模型和经过培训的文本分类结果比已经实现的多语种语言模式更好。