Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT), by drastically reducing the need for large parallel data. Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder. In this work, we systematically compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context. We pretrain models with different methods on English$\leftrightarrow$German, English$\leftrightarrow$Nepali and English$\leftrightarrow$Sinhala monolingual data, and evaluate them on NMT. In (semi-) supervised NMT, varying the pretraining objective leads to surprisingly small differences in the finetuned performance, whereas unsupervised NMT is much more sensitive to it. To understand these results, we thoroughly study the pretrained models using a series of probes and verify that they encode and use information in different ways. We conclude that finetuning on parallel data is mostly sensitive to few properties that are shared by most models, such as a strong decoder, in contrast to unsupervised NMT that also requires models with strong cross-lingual abilities.
翻译:未经监督的跨语言预科培训在神经机器翻译方面取得了巨大成果,极大地减少了对大量平行数据的需求。大多数方法都通过掩蔽输入部分并在解码器中重建这些输入部分,将隐蔽语言模型(MLM)改造成顺序顺序结构。在这项工作中,我们系统地将遮盖与根据背景重新排序和替换词句生成类似真实(完整)句子的输入的替代目标进行比较。我们用不同方法对美元、美元和美元左曲右曲、尼泊尔元和美元左曲右曲、西奈拉单语数据进行预演模型,并在NMT上对其进行评估。在(半)监管下,预培训目标的不同导致微小的微差异,而未经监督的NMT对它更为敏感。为了理解这些结果,我们用一系列的探测器对预设模型进行了彻底研究,并核实它们以不同方式编码和使用信息。我们的结论是,对平行数据进行微调,需要最强的NMT模型与最强的模型进行交叉对比。