Providing access to information across languages has been a goal of Information Retrieval (IR) for decades. While progress has been made on Cross Language IR (CLIR) where queries are expressed in one language and documents in another, the multilingual (MLIR) task to create a single ranked list of documents across many languages is considerably more challenging. This paper investigates whether advances in neural document translation and pretrained multilingual neural language models enable improvements in the state of the art over earlier MLIR techniques. The results show that although combining neural document translation with neural ranking yields the best Mean Average Precision (MAP), 98% of that MAP score can be achieved with an 84% reduction in indexing time by using a pretrained XLM-R multilingual language model to index documents in their native language, and that 2% difference in effectiveness is not statistically significant. Key to achieving these results for MLIR is to fine-tune XLM-R using mixed-language batches from neural translations of MS MARCO passages.
翻译:数十年来,提供跨语言信息检索(IR)一直是信息检索(IR)的一个目标,虽然在跨语言IR(CLIR)方面取得了进展,以一种语言和另一种语言文件进行查询,但多语种(MLIR)的任务是建立一个跨多种语言的单一文件排名清单,这具有相当大的挑战性。本文调查神经文件翻译和预先培训的多语种语言模型的进展是否有助于改善与早期MLIR技术相比的艺术状态。结果显示,虽然将神经文件翻译与神经排序相结合,得出了最佳平均精确度(MAP),但该MAP的98%得分可以通过使用预先培训的XLM-R多语言模型将母语的文件索引化而减少84%的索引时间,而2%的实效差异在统计上并不重要。MLIR取得这些结果的关键是利用MS MARCO段落的神经翻译的混合语言批量对XLM-R进行微调。