Transformer has demonstrated its great power to learn contextual word representations for multiple languages in a single model. To process multilingual sentences in the model, a learnable vector is usually assigned to each language, which is called "language embedding". The language embedding can be either added to the word embedding or attached at the beginning of the sentence. It serves as a language-specific signal for the Transformer to capture contextual representations across languages. In this paper, we revisit the use of language embedding and identify several problems in the existing formulations. By investigating the interaction between language embedding and word embedding in the self-attention module, we find that the current methods cannot reflect the language-specific word correlation well. Given these findings, we propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding. For a sentence, XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model to process with their language-specific meanings. In such a way, XLP achieves the purpose of appropriately encoding "language" in a multilingual Transformer model. Experimental results show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets. Codes and models will be released at https://github.com/lsj2408/XLP.
翻译:变换器展示了在单一模式中学习多语言背景文字表达式的巨大力量。 为了处理模型中的多语言句子, 通常会为每种语言指定一种可学习的矢量, 称为“ 语言嵌入 ” 。 语言嵌入可以添加到嵌入或附在句首的词中。 它可以作为变换器获取跨语言背景表达式的语言特定信号。 在本文中, 我们重新审视语言嵌入的使用, 并找出现有表达式中的若干问题。 通过调查语言嵌入和嵌入到自我注意模块中的文字之间的相互作用, 我们发现当前的方法无法很好地反映语言特定词的关联性。 鉴于这些发现, 我们提议了一种名为“ 跨语言计划( XLP) ” 的新方法来取代语言嵌入。 对于句子, XLP 投放语言嵌入语言特定的语系空间, 然后预测的嵌入将输入到变换器模型中, 其语言特定含义。 这样, XLP 就可以在多语言变换/ 数据库模型中实现“ 语言” 大幅配置“ 语言 ” 和“ 塔里结果 ” 。 在多语言变换模型上, 将显示 ASBL 。 在多语言变码/ 数据库中, 数据库中, 将 将可大的 将显示为 。