Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world. However, they currently require large pretraining corpora or access to typologically similar languages. In this work, we address these obstacles by removing language identity signals from multilingual embeddings. We examine three approaches for this: (i) re-aligning the vector spaces of target languages (all together) to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering. We evaluate on XNLI and reference-free MT across 19 typologically diverse languages. Our findings expose the limitations of these approaches -- unlike vector normalization, vector space re-alignment and text normalization do not achieve consistent gains across encoders and languages. Due to the approaches' additive effects, their combination decreases the cross-lingual transfer gap by 8.9 points (m-BERT) and 18.2 points (XLM-R) on average across all tasks and languages, however. Our code and models are publicly available.
翻译:在这项工作中,我们通过从多语种嵌入中去除语言身份信号来克服这些障碍。我们为此研究了三种办法:(一) 将目标语言的矢量空间(统合)重新调整为一种支流源语言;(二) 消除语言特有手段和差异,这些手段和差异使作为副产品的嵌入产生更好的区别;(三) 通过消除形态缩缩缩和重新排序,增加不同语言的输入的相似性。我们评估XNLI和在19种类型不同语言中的无参考MT。我们的调查结果暴露了这些方法的局限性 -- -- 不同于病媒正常化、矢量空间重新配对和文本正常化,没有在各种编码和语言之间取得一致的收益。由于这些方法的添加效应,它们的结合使跨语言的嵌入作为副产品产生更好的区别性;以及(三) 通过消除形态缩放和重新排序,增加各语言的输入的相似性。我们对所有任务和语言的平均格式和语言的跨语言差异进行了XLM-R。