In order to preserve word-order information in a non-autoregressive setting, transformer architectures tend to include positional knowledge, by (for instance) adding positional encodings to token embeddings. Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture; these include, for instance, separating position encodings and token embeddings, or directly modifying attention weights based on the distance between word pairs. We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models. We then answer why that is: Sinusoidal encodings were explicitly designed to facilitate compositionality by allowing linear projections over arbitrary time steps. Higher variances in multilingual training distributions requires higher compression, in which case, compositionality becomes indispensable. Learned absolute positional encodings (e.g., in mBERT) tend to approximate sinusoidal embeddings in multilingual settings, but more complex positional encoding architectures lack the inductive bias to effectively learn compositionality and cross-lingual alignment. In other words, while sinusoidal positional encodings were originally designed for monolingual applications, they are particularly useful in multilingual language models.
翻译:为了在非偏向环境中保存单顺序信息,变压器结构往往包括定位知识,例如,通过(例如)在象征性嵌入中添加位置编码。对原变压器结构中使用的正弦形位置编码提出了若干修改建议;例如,将位置编码和象征性嵌入分开,或直接改变基于单词对立距离的注意权重。我们首先显示出这一点,虽然这些修改倾向于改进单语语言模型,但没有一项能够产生更好的多语种语言模型。然后我们回答为什么:对等离子编码的明确设计是为了通过允许任意时间步骤进行线性预测来促进构成性。多语种培训分布的更大差异需要更高的压缩,在这种情况下,构成性变得不可或缺。对绝对位置编码(例如,在 mBERT中)倾向于在多语种环境中接近类离子粘合,但更复杂的定位编码结构缺乏感化的偏向性,无法有效地学习组合和跨语种语言校准。在其他词语中,对于多语言的多语言模型应用是有用的。