We describe models focused at the understudied problem of translating between monolingual and code-mixed language pairs. More specifically, we offer a wide range of models that convert monolingual English text into Hinglish (code-mixed Hindi and English). Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models (i.e., mT5 and mBART) on the task finding both to work well. Given the paucity of training data for code-mixing, we also propose a dependency-free method for generating code-mixed texts from bilingual distributed representations that we exploit for improving language model performance. In particular, armed with this additional data, we adopt a curriculum learning approach where we first finetune the language models on synthetic data then on gold code-mixed data. We find that, although simple, our synthetic code-mixing method is competitive with (and in some cases is even superior to) several standard methods (backtranslation, method based on equivalence constraint theory) under a diverse set of conditions. Our work shows that the mT5 model, finetuned following the curriculum learning procedure, achieves best translation performance (12.67 BLEU). Our models place first in the overall ranking of the English-Hinglish official shared task.
翻译:更具体地说,我们提供了将单一语言英文文本转换成Hinglish(编码混合印地语和英语)的多种模式。鉴于经过培训的语言模式最近取得了成功,我们还测试了最近两个基于变异器的编码解码模型(即 mT5 和 mBART)的实用性,认为两者都能很好地发挥作用。鉴于缺乏用于编码混合的培训数据,我们还提议了一种从双语分布式表达中生成编码混合文本的无依赖性方法,我们利用这种方法来改进语言模型的性能。特别是,我们利用这一额外数据,我们采用了一种课程学习方法,首先对合成数据的语言模型进行微调,然后对金编码混合数据进行微调。我们发现,尽管简单,我们的合成编码混合方法与(在某些情况下甚至优于)几种标准方法(回译,基于等效制约理论的方法)具有竞争力,在多种条件下,我们利用这些方法来改进语言模型的性能。我们的工作显示,在最先进的英文模型中,我们学习了M5L总的业绩排名。