Code-switching (CS) poses several challenges to NLP tasks, where data sparsity is a main problem hindering the development of CS NLP systems. In this paper, we investigate data augmentation techniques for synthesizing Dialectal Arabic-English CS text. We perform lexical replacements using parallel corpora and alignments where CS points are either randomly chosen or learnt using a sequence-to-sequence model. We evaluate the effectiveness of data augmentation on language modeling (LM), machine translation (MT), and automatic speech recognition (ASR) tasks. Results show that in the case of using 1-1 alignments, using trained predictive models produces more natural CS sentences, as reflected in perplexity. By relying on grow-diag-final alignments, we then identify aligning segments and perform replacements accordingly. By replacing segments instead of words, the quality of synthesized data is greatly improved. With this improvement, random-based approach outperforms using trained predictive models on all extrinsic tasks. Our best models achieve 33.6% improvement in perplexity, +3.2-5.6 BLEU points on MT task, and 7% relative improvement on WER for ASR task. We also contribute in filling the gap in resources by collecting and publishing the first Arabic English CS-English parallel corpus.
翻译:代码转换( CS) 给 NLP 任务带来了若干挑战, 数据宽度是阻碍 CS NLP 系统开发的一个主要问题。 在本文中, 我们调查了用于合成阿拉伯文- 英文 CS 文本的数据增强技术。 我们使用平行的 Cosora 进行词汇替换, 使用随机选择 CS 点或使用顺序顺序序列模型学习 CS 点的校对。 我们评估了语言建模、 机器翻译( MT) 和自动语音识别( ASR) 任务的数据增加的有效性。 结果显示, 在使用 1-1 校准( 1-1 校准) 的情况下, 使用经过培训的预测模型, 生成了更自然的 CS 句, 这反映在不易解中。 我们通过依赖 增长- diag 最终校正校正校正校正校正校正校正校正校正 校正 校正 ABEUEU 任务中, 我们的最佳模型的改进了33. 6