Pyserini:支持可复制的IR研究的简易和有意识代表的易用的皮顿工具包 (Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations)

Pyserini is an easy-to-use Python toolkit that supports replicable IR research by providing effective first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections. We aim to support, out of the box, the entire research lifecycle of efforts aimed at improving ranking with modern neural approaches. In particular, Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well as hybrid retrieval that integrates both approaches. This paper provides an overview of toolkit features and presents empirical results that illustrate its effectiveness on two popular ranking tasks. We also describe how our group has built a culture of replicability through shared norms and tools that enable rigorous automated testing.

翻译：Pyserini是一个容易使用的Python工具包,它通过在多级排名结构中提供有效的第一阶段检索,支持可复制的IR研究。我们的工具包作为标准的Python软件包自成一体,并附有许多常用IR测试收藏的查询、相关性判断、预建索引和评价脚本。我们的目标是从方框外支持旨在提高现代神经方法排名的整个研究生命周期。特别是,Pyserini支持稀有的检索(例如,用字包表示的BM25评分)、密集的检索(例如,在变压器-编码显示上最接近的邻居搜索)以及结合这两种方法的混合检索。本文概述了工具包的特征,并介绍了表明其在两项流行的排名任务上的有效性的经验性结果。我们还介绍了我们集团如何通过能够进行严格的自动测试的共同规范和工具,建立起一种可复制的文化。

相关内容

关注 14

信息检索杂志（IR）为信息检索的广泛领域中的理论、算法分析和实验的发布提供了一个国际论坛。感兴趣的主题包括对应用程序（例如Web，社交和流媒体，推荐系统和文本档案）的搜索、索引、分析和评估。这包括对搜索中人为因素的研究、桥接人工智能和信息检索以及特定领域的搜索应用程序。官网地址：https://dblp.uni-trier.de/db/journals/ir/

【南京大学】量子计算 (Spring 2021)课程

专知会员服务

59+阅读 · 2021年4月12日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【2020关键词提取】使用多个本地功能从单个文档中提取关键字，YAKE! Keyword extraction from single documents using multiple local features

专知会员服务

26+阅读 · 2020年5月2日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日