Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information. We find that the representation of a language can be obtained by simply averaging the embeddings of the tokens of the language. Given this language representation, we control the output languages of multilingual BERT by manipulating the token embeddings, thus achieving unsupervised token translation. We further propose a computationally cheap but effective approach to improve the cross-lingual ability of m-BERT based on this observation.
翻译:在多语种 BERT (m-BERT) 中嵌入的调子包含语言和语义信息。 我们发现,一种语言的表述可以通过仅仅平均嵌入该语言的象征物来获得。 基于这种语言的表述,我们控制多语言的调控多语种 BERT 的输出语言,操纵代号嵌入,从而实现不受监督的代号翻译。 我们还根据这一观察,提出了一种计算成本低但有效的方法,以提高 m-BERT 的跨语种能力。