Multilingual sentence representations pose a great advantage for low-resource languages that do not have enough data to build monolingual models on their own. These multilingual sentence representations have been separately exploited by few research for document and sentence alignment. However, most of the low-resource languages are under-represented in these pre-trained models. Thus, in the context of low-resource languages, these models have to be fine-tuned for the task at hand, using additional data sources. This paper presents a weighting mechanism that makes use of available small-scale parallel corpora to improve the performance of multilingual sentence representations on document and sentence alignment. Experiments are conducted with respect to two low-resource languages, Sinhala and Tamil. Results on a newly created dataset of Sinhala-English, Tamil-English, and Sinhala-Tamil show that this new weighting mechanism significantly improves both document and sentence alignment. This dataset, as well as the source-code, is publicly released.
翻译:多种语文的表述方式对于没有足够数据自行建立单一语文模式的低资源语言来说是一个很大的优势。这些多语文的表述方式被很少的文档和句子协调研究单独利用,然而,在经过培训的这些模式中,大多数低资源语言的代表性不足。因此,在低资源语言方面,这些模式必须针对手头的任务进行微调,使用更多的数据来源。本文件提出了一个加权机制,利用现有的小型平行社团来改进多语言在文档和句子协调方面的表述方式。对两种低资源语言(僧伽罗语和泰米尔语)进行了实验。关于新创建的僧伽罗语-英语、泰米尔语-英语和僧伽罗语-塔米尔语数据集的结果显示,这一新加权机制大大改进了文档和句子的对齐。这一数据集以及源码都公开发布。