The advent of transformer-based models such as BERT has led to the rise of neural ranking models. These models have improved the effectiveness of retrieval systems well beyond that of lexical term matching models such as BM25. While monolingual retrieval tasks have benefited from large-scale training collections such as MS MARCO and advances in neural architectures, cross-language retrieval tasks have fallen behind these advancements. This paper introduces ColBERT-X, a generalization of the ColBERT multi-representation dense retrieval model that uses the XLM-RoBERTa (XLM-R) encoder to support cross-language information retrieval (CLIR). ColBERT-X can be trained in two ways. In zero-shot training, the system is trained on the English MS MARCO collection, relying on the XLM-R encoder for cross-language mappings. In translate-train, the system is trained on the MS MARCO English queries coupled with machine translations of the associated MS MARCO passages. Results on ad hoc document ranking tasks in several languages demonstrate substantial and statistically significant improvements of these trained dense retrieval models over traditional lexical CLIR baselines.
翻译:以变压器为基础的模型(如BERT)的出现导致神经排位模型的上升,这些模型使检索系统的效力大大超过BM25等词汇匹配模型。虽然单语检索任务受益于诸如MSM MARCO等大规模培训收藏和神经结构的进步,但跨语言检索任务落后于这些进步。本文介绍了ColBERT-X,即ColBERTER多种代表密集检索模型的概括化,该模型使用XLM-ROBERTA(XLM-R)编码器支持跨语言信息检索(CLIR)。ColBERT-X可以用两种方式进行培训。在零速培训中,该系统在英语MS MARCO收集方面得到了培训,在跨语言绘图方面依靠XLM-R编码器。在翻译方面,该系统在MSMMARCO英语查询和相关的MS MARCO通道的机器翻译方面得到了培训。若干语言的临时性文件排序任务显示这些经过训练的密集检索模型在统计上大大改进了传统的CLLI基线。在翻译方面,该系统在翻译方面得到了培训。