Recent studies have demonstrated the potential of cross-lingual transferability by training a unified Transformer encoder for multiple languages. In addition to involving the masked language model objective, existing cross-lingual pre-training works leverage sentence-level contrastive learning or plugs in extra cross-attention module to complement the insufficient capabilities of cross-lingual alignment. Nonetheless, synonym pairs residing in bilingual corpus are not exploited and aligned, which is more crucial than sentence interdependence establishment for token-level tasks. In this work, we propose a cross-lingual pre-trained model VECO~2.0 based on contrastive learning with multi-granularity alignments. Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs. Then, token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance. Experiments show the effectiveness of the proposed strategy for cross-lingual model pre-training on the XTREME benchmark.
翻译:近期研究表明,通过为多种语言训练统一的Transformer编码器可以发挥跨语言转移的潜力。除去包括遮蔽语言模型为目标的内容以外,现有的跨语言预训练工作利用句子水平的对比学习或者插入额外的跨注意力模块来弥补跨语言对齐能力不足。然而,在双语语料库中,存在的同义词对未被利用和对齐,而这对于基于标记的任务比建立句子相互依存性更重要。在这项工作中,我们提出了一种基于多粒度对比学习的跨语言预训练模型VECO 2.0。具体来说,我们诱导序列对齐以最大化并行成对的相似度并最小化非并行成对的相似度。然后,我们集成标记对齐,以实现从双语实例中通过词库挖掘出的同义词标记与其他不成对标记之间的桥梁。实验结果表明,所提出的策略对于XTREME基准中的跨语言模型预训练是有效的。