Long document retrieval aims to fetch query-relevant documents from a large-scale collection, where knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder. However, in contrast to passages or sentences, retrieval on long documents suffers from the scope hypothesis that a long document may cover multiple topics. This maximizes their structure heterogeneity and poses a granular-mismatch issue, leading to an inferior distillation efficacy. In this work, we propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers. While preserving the conventional dense retrieval paradigm, it first produces global-consistent representations crossing different fine granularity and then applies multi-granular aligned distillation merely during training. In experiments, we evaluate our framework on two long-document retrieval benchmarks, which show state-of-the-art performance.
翻译:长篇文件检索旨在从大规模收集中获取与查询有关的文件,在大规模收集中,知识蒸馏事实上已经变成通过模仿一个复杂而强大的交叉编码器来改进检索器,然而,与段落或句子不同,长篇文件检索受到范围假设的影响,即长篇文件可能涵盖多个专题。这最大限度地增加了其结构的异质性,并造成颗粒-颗粒混合问题,导致蒸馏效率低下。在这项工作中,我们为长篇文件检索器提出了一个新的学习框架,即精细的蒸馏法(FGD),用于长篇文件检索器。在保留常规的密集检索模式的同时,它首先产生全球一致的演示,跨越不同的细微颗粒度,然后仅在培训期间应用多份一致的蒸馏法。在实验中,我们用两个长篇文件检索基准来评估我们的框架,这些基准显示了最先进的性能。