Despite recent advancements in Multilingual Information Retrieval (MLIR), a significant gap remains between research and practical deployment. Many studies assess MLIR performance in isolated settings, limiting their applicability to real-world scenarios. In this work, we leverage the unique characteristics of the Quranic multilingual corpus to examine the optimal strategies to develop an ad-hoc IR system for the Islamic domain that is designed to satisfy users' information needs in multiple languages. We prepared eleven retrieval models employing four training approaches: monolingual, cross-lingual, translate-train-all, and a novel mixed method combining cross-lingual and monolingual techniques. Evaluation on an in-domain dataset demonstrates that the mixed approach achieves promising results across diverse retrieval scenarios. Furthermore, we provide a detailed analysis of how different training configurations affect the embedding space and their implications for multilingual retrieval effectiveness. Finally, we discuss deployment considerations, emphasizing the cost-efficiency of deploying a single versatile, lightweight model for real-world MLIR applications.
翻译:尽管多语言信息检索(MLIR)领域近期取得了进展,但研究与实践部署之间仍存在显著差距。许多研究在孤立环境中评估MLIR性能,限制了其在真实场景中的适用性。本研究利用《古兰经》多语言语料库的独特特性,探讨了为满足用户多语言信息需求而开发伊斯兰领域专用检索系统的最佳策略。我们通过四种训练方法(单语言、跨语言、翻译训练全集以及一种结合跨语言与单语言技术的新型混合方法)构建了十一个检索模型。在领域内数据集上的评估表明,混合方法在多样化检索场景中均取得了良好效果。此外,我们深入分析了不同训练配置如何影响嵌入空间及其对多语言检索效能的影响。最后,我们讨论了部署考量,重点强调了为真实世界MLIR应用部署单一通用轻量级模型的成本效益。