Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM) to derive the multilingual sentence representations. The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM. In addition, we observe that a single synthetic bilingual corpus is able to improve results for other language pairs.
翻译:现有的多语种判决嵌入模式需要大量平行数据资源,而低资源语言则不具备这些数据资源。我们提议一种新的、不受监督的方法,以产生仅依赖单语语言数据的多语种判决嵌入。我们首先使用不受监督的机器翻译制作一个合成平行材料,并用它微调一个经过预先训练的跨语言蒙面语言模式(XLM),以得出多语种的句子表述。对两个平行的采矿任务进行了陈述的质量评估,比香草XLM改进了22个F1点。此外,我们注意到,一个单一合成双语材料能够改善其他语言配对的结果。