Recently, several dense retrieval (DR) models have demonstrated competitive performance to term-based retrieval that are ubiquitous in search systems. In contrast to term-based matching, DR projects queries and documents into a dense vector space and retrieves results via (approximate) nearest neighbor search. Deploying a new system, such as DR, inevitably involves tradeoffs in aspects of its performance. Established retrieval systems running at scale are usually well understood in terms of effectiveness and costs, such as query latency, indexing throughput, or storage requirements. In this work, we propose a framework with a set of criteria that go beyond simple effectiveness measures to thoroughly compare two retrieval systems with the explicit goal of assessing the readiness of one system to replace the other. This includes careful tradeoff considerations between effectiveness and various cost factors. Furthermore, we describe guardrail criteria, since even a system that is better on average may have systematic failures on a minority of queries. The guardrails check for failures on certain query characteristics and novel failure types that are only possible in dense retrieval systems. We demonstrate our decision framework on a Web ranking scenario. In that scenario, state-of-the-art DR models have surprisingly strong results, not only on average performance but passing an extensive set of guardrail tests, showing robustness on different query characteristics, lexical matching, generalization, and number of regressions. It is impossible to predict whether DR will become ubiquitous in the future, but one way this is possible is through repeated applications of decision processes such as the one presented here.
翻译:最近,一些密集的检索模型(DR)显示,在搜索系统中无处不在的基于术语的检索模型具有竞争性性能。与基于术语的匹配相比,DR项目查询和文档进入密集的矢量空间,并通过(近距离)近邻搜索检索结果。部署一个新的系统,例如DR,不可避免地涉及其性能的权衡。规模运行的既定检索系统通常在有效性和成本方面得到很好的理解,例如查询延迟度、指数化吞吐量或储存要求。在这项工作中,我们提出了一个框架,其一系列标准超越了简单的有效性措施,将两个检索系统彻底比较到一个系统是否准备好替换另一个系统的明确目标。这包括在有效性和各种成本因素之间谨慎的权衡考虑。此外,我们描述保护性标准,因为即使是一个平均较好的系统,也可能在少数查询上出现系统性的失败。对于某些查询特征和新的故障类型,只有在密集的检索系统中才可能存在。我们在网络排序假设上展示我们的决定框架。在这个假设中,一个状态-水平的应用程序将显示一个普遍的预测性能度模型,从而显示一个不同的预测性标准。