Retrieval with extremely long queries and documents is a well-known and challenging task in information retrieval and is commonly known as Query-by-Document (QBD) retrieval. Specifically designed Transformer models that can handle long input sequences have not shown high effectiveness in QBD tasks in previous work. We propose a Re-Ranker based on the novel Proportional Relevance Score (RPRS) to compute the relevance score between a query and the top-k candidate documents. Our extensive evaluation shows RPRS obtains significantly better results than the state-of-the-art models on five different datasets. Furthermore, RPRS is highly efficient since all documents can be pre-processed, embedded, and indexed before query time which gives our re-ranker the advantage of having a complexity of O(N) where N is the total number of sentences in the query and candidate documents. Furthermore, our method solves the problem of the low-resource training in QBD retrieval tasks as it does not need large amounts of training data, and has only three parameters with a limited range that can be optimized with a grid search even if a small amount of labeled data is available. Our detailed analysis shows that RPRS benefits from covering the full length of candidate documents and queries.
翻译:在信息检索方面,以极为冗长的查询和文件检索是众所周知的、具有挑战性的任务,通常称为逐个查询文件(QBD)检索。专门设计的能够处理长输入序列的变异器模型在以往工作中没有显示出QBD任务的高度效力。我们建议根据新颖的“比例相关性评分”(RPRS)进行“Re-Ranker”,以计算查询和最高级候选文件之间的相关评分。我们的广泛评价显示,RPRS比五个不同数据集的最新模型获得的好得多。此外,RPRS效率很高,因为所有文件都可以在查询时间之前进行预处理、嵌入和索引化,从而使我们重新排序的优势在于将O(N)变复杂,因为N是查询和候选文件中的句数总数。此外,我们的方法解决了QBD检索任务中低资源培训的问题,因为它不需要大量培训数据,而且只有三个参数范围有限,可以优化范围,因为所有文件都可以从我们的网络检索中获取详细数据,即使我们候选人的标签能够充分检索。</s>