Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.
翻译:数据宽度是代码开关(CS)带来的主要挑战之一,在形态丰富语言的情况下,数据宽度进一步加重。关于机器翻译(MT)的任务,形态分解已证明成功地缓解了单一语言背景下的数据宽度;然而,对于 CS 的设置,数据宽度尚未进行调查。在本文件中,我们研究了关于MT性能的不同分解方法的有效性,包括基于形态和基于频率的分解技术。我们试验了从编码开关的阿拉伯语-英语到英语的MT。我们提供了详细分析,检查了数据大小和句子等各种条件,并使用不同程度的CS。经验性结果显示,形态-觉识分解器在分解任务方面表现最佳,但在MT中,情况不完善。然而,我们发现,选择用于MT的分解装置高度取决于数据大小。对于极端低资源情景,显示将频率和形态分解组合进行最佳表现。对于资源较丰富的环境而言,这种组合不会给频率分解带来显著的改进。</s>