Managing the data for Information Retrieval (IR) experiments can be challenging. Dataset documentation is scattered across the Internet and once one obtains a copy of the data, there are numerous different data formats to work with. Even basic formats can have subtle dataset-specific nuances that need to be considered for proper use. To help mitigate these challenges, we introduce a new robust and lightweight tool (ir_datases) for acquiring, managing, and performing typical operations over datasets used in IR. We primarily focus on textual datasets used for ad-hoc search. This tool provides both a python and command line interface to numerous IR datasets and benchmarks. To our knowledge, this is the most extensive tool of its kind. Integrations with popular IR indexing and experimentation toolkits demonstrate the tool's utility. We also provide documentation of these datasets through the ir_datasets catalog: https://ir-datasets.com/. The catalog acts as a hub for information on datasets used in IR, providing core information about what data each benchmark provides as well as links to more detailed information. We welcome community contributions and intend to continue to maintain and grow this tool.
翻译:管理信息检索(IR) 实验的数据可能具有挑战性。 数据集文件分散在互联网上, 一旦获得数据副本, 就需要使用许多不同的数据格式。 即使是基本格式也可以有微妙的数据集具体微妙的细微差别, 需要适当使用。 为了帮助减轻这些挑战, 我们为获取、 管理和进行IR 中所用数据集的典型操作, 引入一个新的强健和轻量级的工具( ir_ datases) 。 我们主要侧重于用于临时搜索的文本数据集。 这个工具既为许多IR 数据集和基准提供一个 python 和命令行界面。 根据我们的知识, 这是其中最广泛的工具 。 与流行的 IR 索引和实验工具包的整合证明了工具的效用。 我们还通过 ir_ datasts 目录提供这些数据集的文件 : https:// ir- datasets. com/ 。 目录行为是用于临时搜索的数据集的中枢纽 。 这个工具为许多 IR 提供每个基准的核心信息和命令线接口。 我们欢迎继续提供这些基准的核心信息, 作为更详细信息的链接。