This research provides the first comprehensive analysis of the performance of pre-trained language models for Sinhala text classification. We test on a set of different Sinhala text classification tasks and our analysis shows that out of the pre-trained multilingual models that include Sinhala (XLM-R, LaBSE, and LASER), XLM-R is the best model by far for Sinhala text classification. We also pre-train two RoBERTa-based monolingual Sinhala models, which are far superior to the existing pre-trained language models for Sinhala. We show that when fine-tuned, these pre-trained language models set a very strong baseline for Sinhala text classification and are robust in situations where labeled data is insufficient for fine-tuning. We further provide a set of recommendations for using pre-trained models for Sinhala text classification. We also introduce new annotated datasets useful for future research in Sinhala text classification and publicly release our pre-trained models.
翻译:这项研究为僧伽罗语文本分类提供了初步综合分析。我们测试了一套不同的僧伽罗语文本分类任务,我们的分析表明,在包括僧伽罗语(XLM-R、LABSE和LASER)在内的经过预先培训的多语种模型中,XLM-R是僧伽罗语文本分类迄今为止的最佳模式。我们还预选了两个基于RoBERTA的单语僧伽罗语模式,这些模式远远高于僧伽罗语现有的经过培训的现有语言模式。我们表明,这些经过培训的语文模式为僧伽罗语文本分类确定了非常强有力的基线,在标签数据不足以进行微调的情况下是健全的。我们还提出了一套关于使用经过培训的僧伽罗语文本分类模型的建议。我们还引入了一套新的附加说明的数据集,用于今后对僧伽罗语文本分类的研究,并公开发布我们经过培训的模型。