We present a simple and effective pretraining strategy {D}en{o}ising {T}raining DoT for neural machine translation. Specifically, we update the model parameters with source- and target-side denoising tasks at the early stage and then tune the model normally. Notably, our approach does not increase any parameters or training steps, requiring the parallel data merely. Experiments show that DoT consistently improves the neural machine translation performance across 12 bilingual and 16 multilingual directions (data size ranges from 80K to 20M). In addition, we show that DoT can complement existing data manipulation strategies, i.e. curriculum learning, knowledge distillation, data diversification, bidirectional training, and back-translation. Encouragingly, we found that DoT outperforms costly pretrained model mBART in high-resource settings. Analyses show DoT is a novel in-domain cross-lingual pretraining strategy and could offer further improvements with task-relevant self-supervisions.
翻译:我们提出了一个简单而有效的培训前战略{D}en{o}ising {Triening DoT {Training DoT } 神经机器翻译。 具体地说, 我们可以在早期用源端和目标端拆卸任务更新模型参数, 然后正常地调整模型。 值得注意的是, 我们的方法并没有增加任何参数或培训步骤, 只需要平行的数据。 实验显示, DoT 不断提高神经机翻译在12个双语方向和16个多语言方向上的性能( 数据大小从80K到20M不等 ) 。 此外, 我们还表明, DoT 能够补充现有的数据处理策略, 即课程学习、 知识蒸馏、 数据多样化、 双向培训 和回译。 令人鼓舞的是, 我们发现, DoT 在高资源环境下, 超越了昂贵的事先训练模型 mBART 。 分析显示 DoT 是一种新颖的跨语言培训前战略, 并且可以提供与任务相关的自我检查的进一步改进。