Data sparsity is a main problem hindering the development of code-switching (CS) NLP systems. In this paper, we investigate data augmentation techniques for synthesizing dialectal Arabic-English CS text. We perform lexical replacements using word-aligned parallel corpora where CS points are either randomly chosen or learnt using a sequence-to-sequence model. We compare these approaches against dictionary-based replacements. We assess the quality of the generated sentences through human evaluation and evaluate the effectiveness of data augmentation on machine translation (MT), automatic speech recognition (ASR), and speech translation (ST) tasks. Results show that using a predictive model results in more natural CS sentences compared to the random approach, as reported in human judgements. In the downstream tasks, despite the random approach generating more data, both approaches perform equally (outperforming dictionary-based replacements). Overall, data augmentation achieves 34% improvement in perplexity, 5.2% relative improvement on WER for ASR task, +4.0-5.1 BLEU points on MT task, and +2.1-2.2 BLEU points on ST over a baseline trained on available data without augmentation.
翻译:数据稀疏性是阻碍代码交替(CS)NLP系统发展的主要问题。在本文中,我们调查数据增强技术,用于合成方言阿拉伯语-英语的CS文本。我们使用单词对齐的并行语料库执行词汇替换,其中CS点是随机选择或使用序列到序列模型学习的。我们将这些方法与基于字典的替换进行比较。我们通过人类评估来评估生成的句子的质量,并评估数据增强对机器翻译(MT),自动语音识别(ASR)和语音翻译(ST)任务的有效性。结果显示,使用预测模型比随机方法生成更自然的CS句子,如人类判断所述。在下游任务中,尽管随机方法生成更多数据,但两种方法的表现相同(优于基于字典的替换)。总体而言,数据增强使得基线在没有增强的可用数据上训练时,语言流畅度方面有34%的提升,在ASR任务中相对提高了5.2%的WER,在MT任务上增加了+4.0-5.1 BLEU分数,在ST上增加了+2.1-2.2 BLEU分数。