Cross-lingual information retrieval (CLIR) helps users find documents in languages different from their queries. This is especially important in academic search, where key research is often published in non-English languages. We present CLIRudit, a novel English-French academic retrieval dataset built from Érudit, a Canadian publishing platform. Using multilingual metadata, we pair English author-written keywords as queries with non-English abstracts as target documents, a method that can be applied to other languages and repositories. We benchmark various first-stage sparse and dense retrievers, with and without machine translation. We find that dense embeddings without translation perform nearly as well as systems using machine translation, that translating documents is generally more effective than translating queries, and that sparse retrievers with document translation remain competitive while offering greater efficiency. Along with releasing the first English-French academic retrieval dataset, we provide a reproducible benchmarking method to improve access to non-English scholarly content.
翻译:跨语言信息检索(CLIR)帮助用户以不同于查询语言的语言查找文档。这在学术搜索中尤为重要,因为关键研究常以非英语语言发表。我们提出了CLIRudit,一个基于加拿大出版平台Érudit构建的新型英法学术检索数据集。利用多语言元数据,我们将英文作者撰写的关键词作为查询,与非英文摘要作为目标文档配对,该方法可应用于其他语言和存储库。我们对多种一阶段稀疏和稠密检索器进行了基准测试,包括使用和不使用机器翻译的情况。研究发现,不使用翻译的稠密嵌入表现几乎与使用机器翻译的系统相当,翻译文档通常比翻译查询更有效,而结合文档翻译的稀疏检索器在保持较高效率的同时仍具竞争力。除了发布首个英法学术检索数据集,我们还提供了一种可复现的基准测试方法,以改善对非英语学术内容的访问。