与实体链接的早期垃圾检索 (Early Stage Sparse Retrieval with Entity Linking)

Despite the advantages of their low-resource settings, traditional sparse retrievers depend on exact matching approaches between high-dimensional bag-of-words (BoW) representations of both the queries and the collection. As a result, retrieval performance is restricted by semantic discrepancies and vocabulary gaps. On the other hand, transformer-based dense retrievers introduce significant improvements on information retrieval tasks by exploiting low-dimensional contextualized representations of the corpus. While dense retrievers are known for their relative effectiveness, they suffer from lower efficiency and lack of generalization issues, when compared to sparse retrievers. For a light-weight retrieval task, high computational resources and time consumption are major barriers encouraging the renunciation of dense models despite potential gains. In this work, we propose boosting the performance of sparse retrievers by expanding both the queries and the documents with linked entities in two formats for the entity names: 1) explicit and 2) hashed. We employ a zero-shot end-to-end dense entity linking system for entity recognition and disambiguation to augment the corpus. By leveraging the advanced entity linking methods, we believe that the effectiveness gap between sparse and dense retrievers can be narrowed. We conduct our experiments on the MS MARCO passage dataset. Since we are concerned with the early stage retrieval in cascaded ranking architectures of large information retrieval systems, we evaluate our results using recall@1000. Our approach is also capable of retrieving documents for query subsets judged to be particularly difficult in prior work. We further demonstrate that the non-expanded and the expanded runs with both explicit and hashed entities retrieve complementary results. Consequently, we adopt a run fusion approach to maximize the benefits of entity linking.

翻译：尽管其低资源设置具有优势,但传统的稀有检索器仍然取决于高维字包(BoW)对查询和收藏的表达方式之间的精确匹配方法。因此,检索性能受到语义差异和词汇差距的限制。另一方面,基于变压器的密集检索器通过利用低维背景化的描述方式对信息检索任务进行重大改进。虽然以其相对有效性而为人所知,但与分散的检索器相比,它们的效率较低,缺乏概括性的问题。对于轻量级检索任务,高计算资源和时间消耗是鼓励放弃密集模型的主要补充障碍,尽管可能有所收获。在这项工作中,我们提议通过扩大查询和与实体关联实体的两种格式:(1) 明确和(2) 大量检索器对信息进行大幅改进。我们采用零点对端对端的密集实体连接系统,以便确认和模糊性地增加数据。通过利用先进的连接方法,我们认为分散和密集的检索器之间的有效性差距是鼓励运行的。我们使用大规模检索系统进行大规模检索的实验。