Learned sparse retrieval (LSR) is a family of first-stage retrieval methods that are trained to generate sparse lexical representations of queries and documents for use with an inverted index. Many LSR methods have been recently introduced, with Splade models achieving state-of-the-art performance on MSMarco. Despite similarities in their model architectures, many LSR methods show substantial differences in effectiveness and efficiency. Differences in the experimental setups and configurations used make it difficult to compare the methods and derive insights. In this work, we analyze existing LSR methods and identify key components to establish an LSR framework that unifies all LSR methods under the same perspective. We then reproduce all prominent methods using a common codebase and re-train them in the same environment, which allows us to quantify how components of the framework affect effectiveness and efficiency. We find that (1) including document term weighting is most important for a method's effectiveness, (2) including query weighting has a small positive impact, and (3) document expansion and query expansion have a cancellation effect. As a result, we show how removing query expansion from a state-of-the-art model can reduce latency significantly while maintaining effectiveness on MSMarco and TripClick benchmarks. Our code is publicly available at https://github.com/thongnt99/learned-sparse-retrieval
翻译:学习稀疏检索(LSR)是一类用于生成查询和文档的稀疏词汇表示的一阶段检索方法,用于反向索引。近来,许多LSR方法被引入,Splade模型在MSMarco上实现了最先进的性能。尽管它们的模型架构相似,但许多LSR方法在效果和效率方面存在巨大差异。使用的实验设置和配置的不同,使比较方法并获得洞察力变得困难。在这项工作中,我们分析现有的LSR方法,并确定关键组件,以建立一个LSR框架,将所有LSR方法统一在相同的视角下。然后,我们使用一个共同的代码库重新实现所有著名的方法,并在同一环境中对它们进行重新训练,这使我们能够量化框架的组件如何影响效果和效率。我们发现(1)包括文档术语加权对方法的有效性最重要,(2)包括查询加权具有小的正面影响,(3)文档扩展和查询扩展具有抵消效应。因此,我们展示了如何在MSMarco和TripClick基准测试中在保持有效性的同时显著减少状态下的模型的查询扩展的延迟。我们的代码公开在https://github.com/thongnt99/learned-sparse-retrieval。