Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.
翻译:数据稀疏性是混合语言(code-switching,即CS)面临的主要挑战之一,尤其是对于形态丰富的语言而言更如此。对于机器翻译(MT)的任务,词形分割已被证明对于减轻单语环境下的数据稀疏性具有显著的效果,但是在CS环境中却尚未进行研究。在本文中,我们研究了不同分割方法对于MT性能的影响,包括基于词形和基于词频的分割技术。我们在从混合阿拉伯语-英语到英语的MT任务上进行了实验。我们提供了详细的分析,检查了各种条件,例如数据大小以及具有不同程度CS的句子。实验结果表明,词形感知分割器在分割任务中表现最佳,但在MT任务中表现不佳。尽管如此,我们发现选择哪种分割设置用于MT高度依赖于数据大小。对于极端低资源情况,使用基于频率和词形的分割组合效果最佳。对于更丰富的情况,这种组合并不比使用基于频率的分割带来显著的改进。