Few-shot dense retrieval (DR) aims to effectively generalize to novel search scenarios by learning a few samples. Despite its importance, there is little study on specialized datasets and standardized evaluation protocols. As a result, current methods often resort to random sampling from supervised datasets to create "few-data" setups and employ inconsistent training strategies during evaluations, which poses a challenge in accurately comparing recent progress. In this paper, we propose a customized FewDR dataset and a unified evaluation benchmark. Specifically, FewDR employs class-wise sampling to establish a standardized "few-shot" setting with finely-defined classes, reducing variability in multiple sampling rounds. Moreover, the dataset is disjointed into base and novel classes, allowing DR models to be continuously trained on ample data from base classes and a few samples in novel classes. This benchmark eliminates the risk of novel class leakage, providing a reliable estimation of the DR model's few-shot ability. Our extensive empirical results reveal that current state-of-the-art DR models still face challenges in the standard few-shot scene. Our code and data will be open-sourced at https://github.com/OpenMatch/ANCE-Tele.
翻译:少样本密集检索(DR)旨在通过学习少量样本有效地推广到新的搜索场景。尽管其重要性,但在专门的数据集和标准化评估协议方面研究较少。因此,当前方法经常从监督数据集中随机抽样以创建“少数据”设置,并在评估期间采用不一致的训练策略,这在准确比较最近的进展方面带来了挑战。在本文中,我们提出了一个定制的FewDR数据集和一个统一的评估基准。具体来说,FewDR采用类别抽样来建立一个标准化的“少样本”设置,具有精细定义的类别,减少了多个抽样回合的变异性。此外,该数据集被分为基础类别和新类别,使DR模型能够在基础类别的丰富数据和新类别中的少量样本上持续训练。该基准消除了新类别泄漏的风险,提供了DR模型少样本能力的可靠估计。我们广泛的实证结果表明,目前最先进的DR模型在标准少样本场景中仍面临挑战。我们的代码和数据将在https://github.com/OpenMatch/ANCE-Tele上开源。