Building Machine Translation (MT) systems for low-resource languages remains challenging. For many language pairs, parallel data are not widely available, and in such cases MT models do not achieve results comparable to those seen with high-resource languages. When data are scarce, it is of paramount importance to make optimal use of the limited material available. To that end, in this paper we propose employing the same parallel sentences multiple times, only changing the way the words are split each time. For this purpose we use several Byte Pair Encoding models, with various merge operations used in their configuration. In our experiments, we use this technique to expand the available data and improve an MT system involving a low-resource language pair, namely English-Esperanto. As an additional contribution, we made available a set of English-Esperanto parallel data in the literary domain.
翻译:建立低资源语言的机器翻译系统(MT)仍具有挑战性。 对于许多语言对来说,平行数据并不广泛,在这种情况下,MT模型不能取得与高资源语言相似的结果。当数据稀缺时,最佳利用现有的有限材料至关重要。为此,我们建议在本文件中多次使用相同的平行句子,但只改变每次单词分割的方式。为此,我们使用数个Byte Pair Encoding模型,并使用各种组合操作。在我们的实验中,我们使用这一技术来扩大现有数据,并改进涉及低资源语言对的MT系统,即英语-埃斯佩兰托语。作为额外的贡献,我们在文学领域提供了一套英文-埃斯佩兰托语平行数据。