Transformers represent the state-of-the-art in Natural Language Processing (NLP) in recent years, proving effective even in tasks done in low-resource languages. While pretrained transformers for these languages can be made, it is challenging to measure their true performance and capacity due to the lack of hard benchmark datasets, as well as the difficulty and cost of producing them. In this paper, we present three contributions: First, we propose a methodology for automatically producing Natural Language Inference (NLI) benchmark datasets for low-resource languages using published news articles. Through this, we create and release NewsPH-NLI, the first sentence entailment benchmark dataset in the low-resource Filipino language. Second, we produce new pretrained transformers based on the ELECTRA technique to further alleviate the resource scarcity in Filipino, benchmarking them on our dataset against other commonly-used transfer learning techniques. Lastly, we perform analyses on transfer learning techniques to shed light on their true performance when operating in low-data domains through the use of degradation tests.
翻译:近年来,变异器代表了自然语言处理(NLP)的最新技术,证明即使在以低资源语言完成的任务中也是有效的。虽然可以对这些语言进行预先培训的变异器,但由于缺乏硬基准数据集以及制作这些数据集的难度和成本,衡量其真实性能和能力是困难的。在本文件中,我们提出三项贡献:首先,我们提出一种方法,用已发表的新闻文章自动制作低资源语言的自然语言推断基准数据集。我们通过这个方法,创建和发布第一句话,即菲律宾低资源语言的首句NewspH-NLI。第二,我们根据ELECTRA技术制作了新的经过培训的变异器,以进一步缓解菲律宾的资源短缺,以我们的数据集为基准,以其他常用的转移学习技术为基准。最后,我们分析了转让学习技术,以便通过使用退化测试在低数据领域运作时,说明其真实性能。