A retrieval model should not only interpolate the training data but also extrapolate well to the queries that are different from the training data. While neural retrieval models have demonstrated impressive performance on ad-hoc search benchmarks, we still know little about how they perform in terms of interpolation and extrapolation. In this paper, we demonstrate the importance of separately evaluating the two capabilities of neural retrieval models. Firstly, we examine existing ad-hoc search benchmarks from the two perspectives. We investigate the distribution of training and test data and find a considerable overlap in query entities, query intent, and relevance labels. This finding implies that the evaluation on these test sets is biased toward interpolation and cannot accurately reflect the extrapolation capacity. Secondly, we propose a novel evaluation protocol to separately evaluate the interpolation and extrapolation performance on existing benchmark datasets. It resamples the training and test data based on query similarity and utilizes the resampled dataset for training and evaluation. Finally, we leverage the proposed evaluation protocol to comprehensively revisit a number of widely-adopted neural retrieval models. Results show models perform differently when moving from interpolation to extrapolation. For example, representation-based retrieval models perform almost as well as interaction-based retrieval models in terms of interpolation but not extrapolation. Therefore, it is necessary to separately evaluate both interpolation and extrapolation performance and the proposed resampling method serves as a simple yet effective evaluation tool for future IR studies.
翻译:虽然神经检索模型在临时搜索基准方面表现出令人印象深刻的绩效,但我们仍对它们在内推和外推方面如何表现知之甚少。在本文件中,我们展示了分别评估神经检索模型两种能力的重要性。首先,我们从两个角度审查现有的临时随机搜索基准。我们调查培训和测试数据的分布情况,发现查询实体、查询意向和相关性标签存在相当大的重叠。这一发现意味着这些测试组的评价偏向于内推,无法准确地反映外推能力。第二,我们提议新的评价程序,分别评估现有基准数据集的内推和外推性表现。我们从相似性的角度重新审视培训和测试数据,并利用重新抽样数据集进行培训和评价。最后,我们利用拟议的评价协议全面重新审视一些广泛采用的神经检索模型。结果显示,从简单的对内推到几乎外推的对等性分析模式,在从简单的对等式对等性分析到几乎是必要的对等性对等性分析。举例来说,评估模式在从简化的对等的对等的对等性分析中进行不同表现。