从机器翻译到代码转换: 生成高质量代码转换文本 (From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text)

from arxiv, In Proceedings of The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)

Generating code-switched text is a problem of growing interest, especially given the scarcity of corpora containing large volumes of real code-switched text. In this work, we adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences starting from monolingual Hindi sentences. We outline a carefully designed curriculum of pretraining steps, including the use of synthetic code-switched text, that enable the model to generate high-quality code-switched text. Using text generated from our model as data augmentation, we show significant reductions in perplexity on a language modeling task, compared to using text from other generative models of CS text. We also show improvements using our text for a downstream code-switched natural language inference task. Our generated text is further subjected to a rigorous evaluation using a human evaluation study and a range of objective metrics, where we show performance comparable (and sometimes even superior) to code-switched text obtained via crowd workers who are native Hindi speakers.

翻译：生成代码开关的文本是一个越来越令人感兴趣的问题,特别是考虑到缺乏含有大量实际代码开关文本的组合体。在这项工作中,我们调整了一个最先进的神经机器翻译模型,从单语印度语的印度语句开始产生印地语-英语代码开关的句子。我们勾画了一个精心设计的训练前步骤课程,包括使用合成代码开关文本,使该模型能够生成高质量的代码开关文本。我们用我们模型生成的文本作为数据增强,显示与使用其他 CS 文本的基因化模型的文本相比,语言建模任务上的差异显著减少。我们还展示了使用我们的文本改进,用于下游代码开关自然语言的推断任务。我们生成的文本还受到严格的评价,使用了人类评估研究和一系列客观的衡量标准,我们在那里显示通过当地印地语语人群工人获得的代码开关文本的可比较(有时甚至更高)。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

ICLR2021放榜了！ 687篇入选34篇得满分！ 48篇orals，108篇spotlights，531篇poster

专知会员服务

24+阅读 · 2021年1月13日

【ICML2020】深度神经网络置信感知学习，Conﬁdence-Aware Learning for Deep Neural Networks

专知会员服务

74+阅读 · 2020年7月6日

【哈佛-ICLR2020】基于残差能量模型的文本生成，Residual Energy-Based Models for Text Generation

专知会员服务

11+阅读 · 2020年4月27日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日