探索将英语转换为Hinglish的文本到文字变换器 (Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing)

We describe models focused at the understudied problem of translating between monolingual and code-mixed language pairs. More specifically, we offer a wide range of models that convert monolingual English text into Hinglish (code-mixed Hindi and English). Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models (i.e., mT5 and mBART) on the task finding both to work well. Given the paucity of training data for code-mixing, we also propose a dependency-free method for generating code-mixed texts from bilingual distributed representations that we exploit for improving language model performance. In particular, armed with this additional data, we adopt a curriculum learning approach where we first finetune the language models on synthetic data then on gold code-mixed data. We find that, although simple, our synthetic code-mixing method is competitive with (and in some cases is even superior to) several standard methods (backtranslation, method based on equivalence constraint theory) under a diverse set of conditions. Our work shows that the mT5 model, finetuned following the curriculum learning procedure, achieves best translation performance (12.67 BLEU). Our models place first in the overall ranking of the English-Hinglish official shared task.

翻译：更具体地说,我们提供了将单一语言英文文本转换成Hinglish(编码混合印地语和英语)的多种模式。鉴于经过培训的语言模式最近取得了成功,我们还测试了最近两个基于变异器的编码解码模型(即 mT5 和 mBART)的实用性,认为两者都能很好地发挥作用。鉴于缺乏用于编码混合的培训数据,我们还提议了一种从双语分布式表达中生成编码混合文本的无依赖性方法,我们利用这种方法来改进语言模型的性能。特别是,我们利用这一额外数据,我们采用了一种课程学习方法,首先对合成数据的语言模型进行微调,然后对金编码混合数据进行微调。我们发现,尽管简单,我们的合成编码混合方法与(在某些情况下甚至优于)几种标准方法(回译,基于等效制约理论的方法)具有竞争力,在多种条件下,我们利用这些方法来改进语言模型的性能。我们的工作显示,在最先进的英文模型中,我们学习了M5L总的业绩排名。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation

专知会员服务

19+阅读 · 2020年11月17日

专知会员服务

39+阅读 · 2020年11月3日

【伯克利】机器学习蛋白质工程，Machine learning for protein engineering，83页ppt

专知会员服务

36+阅读 · 2020年5月9日

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models