In this paper, we present a novel series of Russian information retrieval datasets constructed from the "Did you know..." section of Russian Wikipedia. Our datasets support a range of retrieval tasks, including fact-checking, retrieval-augmented generation, and full-document retrieval, by leveraging interesting facts and their referenced Wikipedia articles annotated at the sentence level with graded relevance. We describe the methodology for dataset creation that enables the expansion of existing Russian Information Retrieval (IR) resources. Through extensive experiments, we extend the RusBEIR research by comparing lexical retrieval models, such as BM25, with state-of-the-art neural architectures fine-tuned for Russian, as well as multilingual models. Results of our experiments show that lexical methods tend to outperform neural models on full-document retrieval, while neural approaches better capture lexical semantics in shorter texts, such as in fact-checking or fine-grained retrieval. Using our newly created datasets, we also analyze the impact of document length on retrieval performance and demonstrate that combining retrieval with neural reranking consistently improves results. Our contribution expands the resources available for Russian information retrieval research and highlights the importance of accurate evaluation of retrieval models to achieve optimal performance. All datasets are publicly available at HuggingFace. To facilitate reproducibility and future research, we also release the full implementation on GitHub.
翻译:本文提出了一系列基于俄语维基百科'你知道吗...'板块构建的新型俄语信息检索数据集。这些数据集通过利用有趣的事实及其引用的维基百科文章(在句子级别标注了分级相关性),支持包括事实核查、检索增强生成和全文档检索在内的多种检索任务。我们描述了数据集构建的方法论,该方法能够扩展现有的俄语信息检索资源。通过大量实验,我们扩展了RusBEIR的研究,比较了词法检索模型(如BM25)与针对俄语微调的最先进神经架构以及多语言模型。实验结果表明,在全文档检索任务中,词法方法往往优于神经模型,而神经方法在较短文本(如事实核查或细粒度检索)中能更好地捕捉词法语义。利用我们新创建的数据集,我们还分析了文档长度对检索性能的影响,并证明将检索与神经重排序相结合能持续提升结果。我们的贡献扩展了俄语信息检索研究的可用资源,并强调了准确评估检索模型以实现最优性能的重要性。所有数据集已在HuggingFace平台公开。为促进可重复性和未来研究,我们还在GitHub上发布了完整的实现代码。