简化数据与 ir_ 数据集同步 (Simplified Data Wrangling with ir_datasets)

Managing the data for Information Retrieval (IR) experiments can be challenging. Dataset documentation is scattered across the Internet and once one obtains a copy of the data, there are numerous different data formats to work with. Even basic formats can have subtle dataset-specific nuances that need to be considered for proper use. To help mitigate these challenges, we introduce a new robust and lightweight tool (ir_datasets) for acquiring, managing, and performing typical operations over datasets used in IR. We primarily focus on textual datasets used for ad-hoc search. This tool provides both a Python and command line interface to numerous IR datasets and benchmarks. To our knowledge, this is the most extensive tool of its kind. Integrations with popular IR indexing and experimentation toolkits demonstrate the tool's utility. We also provide documentation of these datasets through the ir_datasets catalog: https://ir-datasets.com/. The catalog acts as a hub for information on datasets used in IR, providing core information about what data each benchmark provides as well as links to more detailed information. We welcome community contributions and intend to continue to maintain and grow this tool.

翻译：管理信息检索(IR) 实验的数据可能具有挑战性。数据集文件分散在互联网上, 一旦获得数据副本, 就会有许多不同的数据格式。即使是基本格式也可以有微妙的数据集具体微妙的细微差别, 需要适当使用。为了帮助减轻这些挑战, 我们为获取、管理和执行IR 中所用数据集的典型操作, 引入一个新的强力和轻量级的工具( ir_ dataset) 。我们主要关注用于 ad- hoc 搜索的文本数据集。此工具既为许多IR 数据集和基准提供 Python 和命令行界面。根据我们的知识, 这是其中最广泛的工具。与流行的 IR 索引和实验工具包的整合显示了工具的效用。我们还通过 ir_ datatsetes 目录提供这些数据集的文件 : https:// ir- datasets.com/. 。目录功能是IR 中用于数据集信息的中枢, 提供每个数据库的核心信息和命令线界面, 继续提供每个基准提供哪些核心信息, 作为更详细的信息链接。我们欢迎这些工具, 继续发展。