神经机械翻译培训前任务 (Synthetic Pre-Training Tasks for Neural Machine Translation)

Pre-training is an effective technique for ensuring robust performance on a variety of machine learning tasks. It typically depends on large-scale crawled corpora that can result in toxic or biased models. Such data can also be problematic with respect to copyright, attribution, and privacy. Pre-training with synthetic tasks and data is a promising way of alleviating such concerns since no real-world information is ingested by the model. Our goal in this paper is to understand what makes for a good pre-trained model when using synthetic resources. We answer this question in the context of neural machine translation by considering two novel approaches to translation model pre-training. Our first approach studies the effect of pre-training on obfuscated data derived from a parallel corpus by mapping words to a vocabulary of 'nonsense' tokens. Our second approach explores the effect of pre-training on procedurally generated synthetic parallel data that does not depend on any real human language corpus. Our empirical evaluation on multiple language pairs shows that, to a surprising degree, the benefits of pre-training can be realized even with obfuscated or purely synthetic parallel data. In our analysis, we consider the extent to which obfuscated and synthetic pre-training techniques can be used to mitigate the issue of hallucinated model toxicity.

翻译：预培训是确保各种机器学习任务取得有力业绩的有效技术,通常取决于大规模爬行公司,这可能导致有毒或偏颇的模式。这些数据在版权、归属和隐私方面也可能存在问题。合成任务和数据培训是缓解这类关切的一个有希望的方法,因为模型没有吸收真实世界的信息。我们本文件的目标是了解在使用合成资源时,如何使良好的预培训模式成为良好的预培训模式。我们从神经机器翻译的角度回答这一问题,方法是考虑两种新的翻译模式预培训方法。我们的第一个方法研究的是,通过对“nonsense”符号词汇的平行内容所产生的数据进行预培训的效果。我们的第二个方法探讨了预先培训对程序上生成的合成平行数据的影响,而这些数据并不取决于任何真正的人类语言资料。我们对多种语言配对的经验评估表明,即使采用模糊或纯合成平行的数据,在某种程度上,预培训的好处也是可以实现的。在我们的分析中,我们考虑如何降低合成毒性的问题。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日