In this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a number of diverse language pairs. We first treat these models as multilingual text encoders and benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR -- a setup with no relevance judgments for IR-specific fine-tuning -- pretrained multilingual encoders on average fail to significantly outperform earlier models based on CLWEs. For sentence-level retrieval, we do obtain state-of-the-art performance: the peak scores, however, are met by multilingual encoders that have been further specialized, in a supervised fashion, for sentence understanding tasks, rather than using their vanilla 'off-the-shelf' variants. Following these results, we introduce localized relevance matching for document-level CLIR, where we independently score a query against document sections. In the second part, we evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments. Our results show that supervised re-ranking rarely improves the performance of multilingual transformers as unsupervised base rankers. Finally, only with in-domain contrastive fine-tuning (i.e., same domain, only language transfer), we manage to improve the ranking quality. We uncover substantial empirical differences between cross-lingual retrieval results and results of (zero-shot) cross-lingual transfer for monolingual retrieval in target languages, which point to "monolingual overfitting" of retrieval models trained on monolingual data.
翻译:在这项工作中,我们提出了一个系统的实证研究,重点是最先进的多语文编码器是否适合跨语言文档和若干不同语言配对的句子检索任务。我们首先将这些模型视为多语言文本编码器,并以未经监督的特设和文件级别CLIR衡量其业绩。与监督的语言理解相反,我们的结果表明,对于未经监督的文件级CLIR来说,这种设置对IR的具体微调没有相关判断 -- 事先培训的多语言编码器平均不能大大优于基于 CLWES的早期微调模型。对于句级检索,我们确实取得了最先进的成绩:但是,最高峰的评分是经过监督的、更加专业化的多语言编码编码编码编码器,而不是使用他们的vanilla 'off-shelf'变异调。根据这些结果,我们为文件级CLIIR引入了本地适切的比比值,我们在那里与文件节比得分。在第二个部分中,我们评估了跨语言变异的版本的版本质量转换结果,我们用在在线数据排序中学习了我们的标准级数据。最后的C级升级,我们学习了英级数据。