Neural information retrieval (IR) systems have progressed rapidly in recent years, in large part due to the release of publicly available benchmarking tasks. Unfortunately, some dimensions of this progress are illusory: the majority of the popular IR benchmarks today focus exclusively on downstream task accuracy and thus conceal the costs incurred by systems that trade away efficiency for quality. Latency, hardware cost, and other efficiency considerations are paramount to the deployment of IR systems in user-facing settings. We propose that IR benchmarks structure their evaluation methodology to include not only metrics of accuracy, but also efficiency considerations such as a query latency and the corresponding cost budget for a reproducible hardware setting. For the popular IR benchmarks MS MARCO and XOR-TyDi, we show how the best choice of IR system varies according to how these efficiency considerations are chosen and weighed. We hope that future benchmarks will adopt these guidelines toward more holistic IR evaluation.
翻译:近年来,神经信息检索系统进展迅速,这在很大程度上是由于公布了公开的基准任务。不幸的是,这种进展的某些方面是虚幻的:当今流行的IR基准大多完全侧重于下游任务准确性,从而掩盖了以质量换去效率的系统引起的费用。 弹性、硬件成本和其他效率因素对于在用户定位环境中部署IR系统至关重要。 我们建议IR基准制定其评价方法,不仅包括准确度指标,而且包括效率考虑因素,如查询时长和可复制硬件设置的相应成本预算。对于IMR基准MS MARCO和XOR-TyDi,我们表明IR系统的最佳选择因选择和权衡这些效率因素的方式而不同。我们希望未来的基准将采用这些基准,以进行更全面的IR评价。