This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting. We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations from LMVR and MORSEL. We evaluate translation tasks between English and each of Nepali, Sinhala, and Kazakh, and predict that using morphologically-based segmentation methods would lead to better performance in this setting. However, comparing to BPE, we find that no consistent and reliable differences emerge between the segmentation methods. While morphologically-based methods outperform BPE in a few cases, what performs best tends to vary across tasks, and the performance of segmentation methods is often statistically indistinguishable.
翻译:本文评估了在低资源神经机器翻译环境中几种现代子字分解方法的性能。 我们比较了在象征性或句子上应用BPE产生的分解法与LMVR和MORSEL基于形态的分解法的分解法。 我们评估了英语与尼泊尔、僧伽罗和哈萨克各人之间的翻译任务,并预测使用基于形态的分解法可以改善这一环境的性能。 但是,与BPE相比,我们发现分解方法之间没有出现一致和可靠的差异。 虽然在少数情况下,基于形态的方法优于BPE,但最能表现的往往是不同任务之间的不同,分解方法的性能在统计上往往无法区分。