Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy. To analyze the importance of focusing on a single language and the importance of a large training set, we compare created models with existing monolingual and multilingual BERT models for Estonian, Latvian, and Lithuanian. The results show that the newly created LitLat BERT and Est-RoBERTa models improve the results of existing models on all tested tasks in most situations.
翻译:研究显示,单语模式比多语种模式产生的结果要好,但培训数据集必须足够大。我们为立陶宛、拉脱维亚和英语培训了一个类似三种语言的LitLat BERT模式,为爱沙尼亚培训了一个单一语言的Est-ROBERTA模式。我们评估了它们在四项下游任务方面的表现:名称实体识别、依赖分析、部分语音标记和词类比。为了分析注重单一语言的重要性和大型培训集的重要性,我们将创建的模式与现有的爱沙尼亚、拉脱维亚和立陶宛的单语和多语言的BERT模式进行比较。结果显示,新建的LitLat BERT和Est-ROBERTA模式改善了大多数情况下所有测试任务的现有模式的结果。