Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance in downstream cross-lingual tasks. This improvement benefits from learning a large amount of monolingual and parallel corpora. Although it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for low-resource languages. In this paper, we propose ERNIE-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance. Our key insight is to integrate back-translation into the pre-training process. We generate pseudo-parallel sentence pairs on a monolingual corpus to enable the learning of semantic alignments between different languages, thereby enhancing the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results in various cross-lingual downstream tasks.
翻译:最近的研究显示,经过培训的跨语言模式在下游跨语言任务中取得了令人印象深刻的成绩,学习大量单一语言和平行社团有助于这一改进。虽然人们普遍承认平行社团对于改进模式绩效至关重要,但现有方法往往受到平行社团规模的限制,特别是对于低资源语言而言。在本论文中,我们提议了ERNIE-M这一新的培训方法,鼓励该模式将多种语言的表述与单一语言的社团统一起来,以克服平行体体积对模式绩效的制约。我们的关键见解是将回译纳入培训前过程。我们制作了单语言组合的假单语言句配对,以便能够学习不同语言之间的语义一致性,从而加强跨语言模式的语义模型。实验结果表明,ERNIE-M超越了现有的跨语言模式,并在各种跨语言的下游任务中提供新的最新成果。