Cross-lingual retrieval aims to retrieve relevant text across languages. Current methods typically achieve cross-lingual retrieval by learning language-agnostic text representations in word or sentence level. However, how to learn phrase representations for cross-lingual phrase retrieval is still an open problem. In this paper, we propose XPR, a cross-lingual phrase retriever that extracts phrase representations from unlabeled example sentences. Moreover, we create a large-scale cross-lingual phrase retrieval dataset, which contains 65K bilingual phrase pairs and 4.2M example sentences in 8 English-centric language pairs. Experimental results show that XPR outperforms state-of-the-art baselines which utilize word-level or sentence-level representations. XPR also shows impressive zero-shot transferability that enables the model to perform retrieval in an unseen language pair during training. Our dataset, code, and trained models are publicly available at www.github.com/cwszz/XPR/.
翻译:跨语言检索旨在检索跨语言的相关文本。目前的方法通常通过在文字或句子一级学习语言不可知文本表示方式实现跨语言检索。然而,如何学习跨语言短语检索方式的短语表示方式仍然是一个尚未解决的问题。在本文件中,我们提议XPR,一个跨语言短语检索器,从未贴标签的示例句中提取短语表示方式。此外,我们创建了一个大型跨语言短语检索数据集,其中包含65K双语词组配对和8对以英语为中心语言的4.2M例句。实验结果表明,XPR优于使用字级或句级表示方式的最先进的基线。XPR还展示了令人印象深刻的零速转换性,使该模型能够在培训期间用看不见的语言对子进行检索。我们的数据集、代码和经过培训的模型可在www.github.com/cwzz/XPR/上公开查阅。