In this paper, we improve on existing language resources for the low-resource Filipino language in two ways. First, we outline the construction of the TLUnified dataset, a large-scale pretraining corpus that serves as an improvement over smaller existing pretraining datasets for the language in terms of scale and topic variety. Second, we pretrain new Transformer language models following the RoBERTa pretraining technique to supplant existing models trained with small corpora. Our new RoBERTa models show significant improvements over existing Filipino models in three benchmark datasets with an average gain of 4.47% test accuracy across the three classification tasks of varying difficulty.
翻译:在本文中,我们以两种方式改进了菲律宾低资源语言的现有语言资源。 首先,我们概述了TLUICT数据集的构建情况。TLUICT数据集是一个大型的训练前程序,在规模和主题多样性方面比较小的现有语言培训前数据集有所改进。第二,我们按照RoBERTA预培训技术预设了新的变换语言模型,以取代经过小型公司培训的现有模型。我们新的RoBERTA模型显示,三个基准数据集比现有的菲律宾模型有显著改进,在三个困难的分类任务中,平均增加了4.47%的测试精度。