Generating code-switched text is a problem of growing interest, especially given the scarcity of corpora containing large volumes of real code-switched text. In this work, we adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences starting from monolingual Hindi sentences. We outline a carefully designed curriculum of pretraining steps, including the use of synthetic code-switched text, that enable the model to generate high-quality code-switched text. Using text generated from our model as data augmentation, we show significant reductions in perplexity on a language modeling task, compared to using text from other generative models of CS text. We also show improvements using our text for a downstream code-switched natural language inference task. Our generated text is further subjected to a rigorous evaluation using a human evaluation study and a range of objective metrics, where we show performance comparable (and sometimes even superior) to code-switched text obtained via crowd workers who are native Hindi speakers.
翻译:生成代码开关的文本是一个越来越令人感兴趣的问题,特别是考虑到缺乏含有大量实际代码开关文本的组合体。在这项工作中,我们调整了一个最先进的神经机器翻译模型,从单语印度语的印度语句开始产生印地语-英语代码开关的句子。我们勾画了一个精心设计的训练前步骤课程,包括使用合成代码开关文本,使该模型能够生成高质量的代码开关文本。我们用我们模型生成的文本作为数据增强,显示与使用其他 CS 文本的基因化模型的文本相比,语言建模任务上的差异显著减少。我们还展示了使用我们的文本改进,用于下游代码开关自然语言的推断任务。我们生成的文本还受到严格的评价,使用了人类评估研究和一系列客观的衡量标准,我们在那里显示通过当地印地语语人群工人获得的代码开关文本的可比较(有时甚至更高)。