Pre-training is an effective technique for ensuring robust performance on a variety of machine learning tasks. It typically depends on large-scale crawled corpora that can result in toxic or biased models. Such data can also be problematic with respect to copyright, attribution, and privacy. Pre-training with synthetic tasks and data is a promising way of alleviating such concerns since no real-world information is ingested by the model. Our goal in this paper is to understand what makes for a good pre-trained model when using synthetic resources. We answer this question in the context of neural machine translation by considering two novel approaches to translation model pre-training. Our first approach studies the effect of pre-training on obfuscated data derived from a parallel corpus by mapping words to a vocabulary of 'nonsense' tokens. Our second approach explores the effect of pre-training on procedurally generated synthetic parallel data that does not depend on any real human language corpus. Our empirical evaluation on multiple language pairs shows that, to a surprising degree, the benefits of pre-training can be realized even with obfuscated or purely synthetic parallel data. In our analysis, we consider the extent to which obfuscated and synthetic pre-training techniques can be used to mitigate the issue of hallucinated model toxicity.
翻译:预培训是确保各种机器学习任务取得有力业绩的有效技术,通常取决于大规模爬行公司,这可能导致有毒或偏颇的模式。这些数据在版权、归属和隐私方面也可能存在问题。合成任务和数据培训是缓解这类关切的一个有希望的方法,因为模型没有吸收真实世界的信息。我们本文件的目标是了解在使用合成资源时,如何使良好的预培训模式成为良好的预培训模式。我们从神经机器翻译的角度回答这一问题,方法是考虑两种新的翻译模式预培训方法。我们的第一个方法研究的是,通过对“nonsense”符号词汇的平行内容所产生的数据进行预培训的效果。我们的第二个方法探讨了预先培训对程序上生成的合成平行数据的影响,而这些数据并不取决于任何真正的人类语言资料。我们对多种语言配对的经验评估表明,即使采用模糊或纯合成平行的数据,在某种程度上,预培训的好处也是可以实现的。在我们的分析中,我们考虑如何降低合成毒性的问题。