Multilingual pretrained language models (mPLMs) have shown their effectiveness in multilingual word alignment induction. However, these methods usually start from mBERT or XLM-R. In this paper, we investigate whether multilingual sentence Transformer LaBSE is a strong multilingual word aligner. This idea is non-trivial as LaBSE is trained to learn language-agnostic sentence-level embeddings, while the alignment extraction task requires the more fine-grained word-level embeddings to be language-agnostic. We demonstrate that the vanilla LaBSE outperforms other mPLMs currently used in the alignment task, and then propose to finetune LaBSE on parallel corpus for further improvement. Experiment results on seven language pairs show that our best aligner outperforms previous state-of-the-art models of all varieties. In addition, our aligner supports different language pairs in a single model, and even achieves new state-of-the-art on zero-shot language pairs that does not appear in the finetuning process.
翻译:多语言预先培训语言模式(mPLMs)在多语种统一语言感应上显示了其有效性。 但是,这些方法通常从 mBERT 或 XLM-R 开始。 在本文件中,我们调查多语种句变换器 LaBSE 是否是一个强大的多语种统一词。 这个想法是非三边的,因为LABSE 受过培训可以学习语言-不可知的句级嵌入, 而校准提取任务则要求更精细的单词级嵌入为语言不可知性。 我们证明香草 LaBSE 优于当前校准任务中使用的其他 mPLMS, 然后建议对平行程序进行微调 LABSE 。 七对语言的实验结果显示, 我们的最佳匹配器超越了以往所有品种的艺术模式。 此外, 我们的校准器支持单一模式中不同的语言配对, 甚至实现在微调过程中没有出现的新状态语言配对 。