ColBERT-X is a dense retrieval model for Cross Language Information Retrieval (CLIR). In CLIR, documents are written in one natural language, while the queries are expressed in another. A related task is multilingual IR (MLIR) where the system creates a single ranked list of documents written in many languages. Given that ColBERT-X relies on a pretrained multilingual neural language model to rank documents, a multilingual training procedure can enable a version of ColBERT-X well-suited for MLIR. This paper describes that training procedure. An important factor for good MLIR ranking is fine-tuning XLM-R using mixed-language batches, where the same query is matched with documents in different languages in the same batch. Neural machine translations of MS MARCO passages are used to fine-tune the model.
翻译:ColBERT-X是跨语言信息检索(CLIR)的密集检索模式,在CLIR中,文件用一种自然语言写成,而询问则用另一种语言表示。与此相关的一项任务是多语言的IR(MLIR),该系统创建了以多种语言写成的文件的单一排名清单。鉴于ColBERT-X(ColBERT-X)依靠预先培训的多语言神经语言模型来对文件进行排位,多语言培训程序可以使ColBERT-X(CLIR)的版本适合MLIR(CLIR) 。本文描述了培训程序。良好的 MLIR排名的一个重要因素是使用混合语言批量的微调 XLM-R(XLM-R),同一批次的查询与不同语言的文件相匹配。使用MS MARCO通道的神经机器翻译来微调模型。