Existing work in multilingual pretraining has demonstrated the potential of cross-lingual transferability by training a unified Transformer encoder for multiple languages. However, much of this work only relies on the shared vocabulary and bilingual contexts to encourage the correlation across languages, which is loose and implicit for aligning the contextual representations between languages. In this paper, we plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages. It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language. More importantly, when fine-tuning on downstream tasks, the cross-attention module can be plugged in or out on-demand, thus naturally benefiting a wider range of cross-lingual tasks, from language understanding to generation. As a result, the proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark, covering text classification, sequence labeling, question answering, and sentence retrieval. For cross-lingual generation tasks, it also outperforms all existing cross-lingual models and state-of-the-art Transformer variants on WMT14 English-to-German and English-to-French translation datasets, with gains of up to 1~2 BLEU.
翻译:多语种培训前的现有工作通过培训一个统一的变换器编码器,显示了跨语言转移的潜力;然而,许多这项工作只依靠共同的词汇和双语背景来鼓励不同语言之间的相互关联,这种关联是松散和隐含的,用于协调不同语言的背景关系;在本文件中,我们将交叉注意模块插入变换器编码器,以明确建立不同语言之间的相互依存关系;它能够有效地避免仅以其本语言的文字分类、顺序标签、答题和句检索为条件预测蒙面词的退化;更重要的是,当对下游任务进行微调时,交叉注意模块可以按需插入或退出,从而自然地使更广泛的跨语言任务受益,从语言理解到代代。因此,拟议的跨语言模型在XTREME基准的各种跨语言理解任务中提供了新的最新结果,涵盖文本分类、顺序标签、答题和句回回调。对于跨语言的生成任务而言,它也超越了现有的跨语模式和州级变换模式,从而自然有利于更广泛的跨语言任务,从语言任务,从语言理解到代。因此,跨语言模式可以提供新的全语种跨语言的跨语种任务。