学习稀疏检索的统一框架 (A Unified Framework for Learned Sparse Retrieval) - 专知论文

会员服务 ·

0

MSMARCO · 稀疏 · state-of-the-art · Learning · Weight ·

2023 年 3 月 23 日

A Unified Framework for Learned Sparse Retrieval

翻译：学习稀疏检索的统一框架

Thong Nguyen,Sean MacAvaney,Andrew Yates

Learned sparse retrieval (LSR) is a family of first-stage retrieval methods that are trained to generate sparse lexical representations of queries and documents for use with an inverted index. Many LSR methods have been recently introduced, with Splade models achieving state-of-the-art performance on MSMarco. Despite similarities in their model architectures, many LSR methods show substantial differences in effectiveness and efficiency. Differences in the experimental setups and configurations used make it difficult to compare the methods and derive insights. In this work, we analyze existing LSR methods and identify key components to establish an LSR framework that unifies all LSR methods under the same perspective. We then reproduce all prominent methods using a common codebase and re-train them in the same environment, which allows us to quantify how components of the framework affect effectiveness and efficiency. We find that (1) including document term weighting is most important for a method's effectiveness, (2) including query weighting has a small positive impact, and (3) document expansion and query expansion have a cancellation effect. As a result, we show how removing query expansion from a state-of-the-art model can reduce latency significantly while maintaining effectiveness on MSMarco and TripClick benchmarks. Our code is publicly available at https://github.com/thongnt99/learned-sparse-retrieval

翻译：学习稀疏检索（LSR）是一类用于生成查询和文档的稀疏词汇表示的一阶段检索方法，用于反向索引。近来，许多LSR方法被引入，Splade模型在MSMarco上实现了最先进的性能。尽管它们的模型架构相似，但许多LSR方法在效果和效率方面存在巨大差异。使用的实验设置和配置的不同，使比较方法并获得洞察力变得困难。在这项工作中，我们分析现有的LSR方法，并确定关键组件，以建立一个LSR框架，将所有LSR方法统一在相同的视角下。然后，我们使用一个共同的代码库重新实现所有著名的方法，并在同一环境中对它们进行重新训练，这使我们能够量化框架的组件如何影响效果和效率。我们发现（1）包括文档术语加权对方法的有效性最重要，（2）包括查询加权具有小的正面影响，（3）文档扩展和查询扩展具有抵消效应。因此，我们展示了如何在MSMarco和TripClick基准测试中在保持有效性的同时显著减少状态下的模型的查询扩展的延迟。我们的代码公开在https://github.com/thongnt99/learned-sparse-retrieval。

0

相关内容

MSMARCO

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日

NeurIPS 2021丨K-Net: 迈向统一的图像分割

NeurIPS 2021丨K-Net: 迈向统一的图像分割

专知会员服务

17+阅读 · 2021年11月25日

【NeurIPS2021】神经网络表示的相似度和匹配

【NeurIPS2021】神经网络表示的相似度和匹配

专知会员服务

27+阅读 · 2021年10月29日

【SIGIR2020】策略感知的无偏排序学习—Top-K排序，Policy-Aware Unbiased Learning to Rank for Top-𝑘 Rankings

【SIGIR2020】策略感知的无偏排序学习—Top-K排序，Policy-Aware Unbiased Learning to Rank for Top-𝑘 Rankings

专知会员服务

27+阅读 · 2020年6月10日

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

专知会员服务

49+阅读 · 2020年5月26日

【SIGIR2020】学习词项区分性，Learning Term Discrimination

【SIGIR2020】学习词项区分性，Learning Term Discrimination

专知会员服务

16+阅读 · 2020年4月28日

用于大型遥感影像检索的深度学习，Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives

用于大型遥感影像检索的深度学习，Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives

专知会员服务

39+阅读 · 2020年4月6日

【新书】深度学习搜索，Deep Learning for Search，附327页pdf

【新书】深度学习搜索，Deep Learning for Search，附327页pdf

专知会员服务

213+阅读 · 2020年1月13日

【NLP| 推荐文章】基于文本和知识库的语义搜索（Semantic search on text and knowledge bases）

专知会员服务

46+阅读 · 2019年11月24日

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

专知会员服务

20+阅读 · 2019年11月24日

量化金融强化学习论文集合

量化金融强化学习论文集合

专知

14+阅读 · 2019年12月18日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

专知

12+阅读 · 2018年6月9日

【论文推荐】最新七篇图像分割相关论文—域适应深度表示学习、循环残差卷积、二值分割、图像合成、无监督跨模态

【论文推荐】最新七篇图像分割相关论文—域适应深度表示学习、循环残差卷积、二值分割、图像合成、无监督跨模态

专知

19+阅读 · 2018年6月1日

【论文推荐】最新七篇视觉问答（VQA）相关论文—差别注意力机制、视觉问题推理、视觉对话、数据可视化、记忆增强网络、显式推理

【论文推荐】最新七篇视觉问答（VQA）相关论文—差别注意力机制、视觉问题推理、视觉对话、数据可视化、记忆增强网络、显式推理

专知

17+阅读 · 2018年4月19日

【论文推荐】最新5篇度量学习（Metric Learning）相关论文—人脸验证、BIER、自适应图卷积、注意力机制、单次学习

【论文推荐】最新5篇度量学习（Metric Learning）相关论文—人脸验证、BIER、自适应图卷积、注意力机制、单次学习

专知

17+阅读 · 2018年2月11日

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

专知

15+阅读 · 2018年2月3日

基于结构约束的多模态学习理论和方法

国家自然科学基金

6+阅读 · 2014年12月31日

基于贝叶斯模型的鲁棒高光谱解混方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

基于成对约束的自适应半监督降维方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

变分框架下的一类非局部的椭圆问题

国家自然科学基金

0+阅读 · 2013年12月31日

多元函数的稀疏逼近与随机逼近

国家自然科学基金

1+阅读 · 2012年12月31日

基于鲁棒相似性测度的含噪图像分割的谱聚类方法

国家自然科学基金

0+阅读 · 2012年12月31日

多卫星导航系统时空基准的统一

国家自然科学基金

1+阅读 · 2012年12月31日

基于图的大规模异质信息网络的匹配查询关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

WGOS层次化语义系统模型研究

国家自然科学基金

1+阅读 · 2008年12月31日

Continual Vision-Language Representation Learning with Off-Diagonal Information

Arxiv

0+阅读 · 2023年5月15日

Sequential model correction for nonlinear inverse problems

Arxiv

0+阅读 · 2023年5月12日

SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

Arxiv

0+阅读 · 2023年5月12日

Estimating and Maximizing Mutual Information for Knowledge Distillation

Arxiv

0+阅读 · 2023年5月11日

Active Learning in the Predict-then-Optimize Framework: A Margin-Based Approach

Arxiv

0+阅读 · 2023年5月11日

Evaluating Embedding APIs for Information Retrieval

Arxiv

0+阅读 · 2023年5月10日

Sparse Spatial Transformers for Few-Shot Learning

Arxiv

0+阅读 · 2023年5月10日

Cross-Modal Object Tracking: Modality-Aware Representations and A Unified Benchmark

Arxiv

14+阅读 · 2021年11月11日

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Arxiv

11+阅读 · 2020年10月20日

A survey on deep hashing for image retrieval

A survey on deep hashing for image retrieval

Arxiv

15+阅读 · 2020年6月10日

VIP会员

文章信息

相关主题

state-of-the-art

相关VIP内容

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日

NeurIPS 2021丨K-Net: 迈向统一的图像分割

NeurIPS 2021丨K-Net: 迈向统一的图像分割

专知会员服务

17+阅读 · 2021年11月25日

【NeurIPS2021】神经网络表示的相似度和匹配

【NeurIPS2021】神经网络表示的相似度和匹配

专知会员服务

27+阅读 · 2021年10月29日

【SIGIR2020】策略感知的无偏排序学习—Top-K排序，Policy-Aware Unbiased Learning to Rank for Top-𝑘 Rankings

【SIGIR2020】策略感知的无偏排序学习—Top-K排序，Policy-Aware Unbiased Learning to Rank for Top-𝑘 Rankings

专知会员服务

27+阅读 · 2020年6月10日

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

专知会员服务

49+阅读 · 2020年5月26日

【SIGIR2020】学习词项区分性，Learning Term Discrimination

【SIGIR2020】学习词项区分性，Learning Term Discrimination

专知会员服务

16+阅读 · 2020年4月28日

用于大型遥感影像检索的深度学习，Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives

用于大型遥感影像检索的深度学习，Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives

专知会员服务

39+阅读 · 2020年4月6日

【新书】深度学习搜索，Deep Learning for Search，附327页pdf

【新书】深度学习搜索，Deep Learning for Search，附327页pdf

专知会员服务

213+阅读 · 2020年1月13日

【NLP| 推荐文章】基于文本和知识库的语义搜索（Semantic search on text and knowledge bases）

专知会员服务

46+阅读 · 2019年11月24日

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

专知会员服务

20+阅读 · 2019年11月24日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】低维与高维空间中潜在表征的分析、建模与变换

《生态建模密码破译：建模与编程实践》美陆军最新报告

大模型解决方案白皮书：社交陪伴场景全流程落地指南

面向具身操作的视觉-语言-动作模型综述

相关资讯

量化金融强化学习论文集合

量化金融强化学习论文集合

专知

14+阅读 · 2019年12月18日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

专知

12+阅读 · 2018年6月9日

【论文推荐】最新七篇图像分割相关论文—域适应深度表示学习、循环残差卷积、二值分割、图像合成、无监督跨模态

【论文推荐】最新七篇图像分割相关论文—域适应深度表示学习、循环残差卷积、二值分割、图像合成、无监督跨模态

专知

19+阅读 · 2018年6月1日

【论文推荐】最新七篇视觉问答（VQA）相关论文—差别注意力机制、视觉问题推理、视觉对话、数据可视化、记忆增强网络、显式推理

【论文推荐】最新七篇视觉问答（VQA）相关论文—差别注意力机制、视觉问题推理、视觉对话、数据可视化、记忆增强网络、显式推理

专知

17+阅读 · 2018年4月19日

【论文推荐】最新5篇度量学习（Metric Learning）相关论文—人脸验证、BIER、自适应图卷积、注意力机制、单次学习

【论文推荐】最新5篇度量学习（Metric Learning）相关论文—人脸验证、BIER、自适应图卷积、注意力机制、单次学习

专知

17+阅读 · 2018年2月11日

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

专知

15+阅读 · 2018年2月3日

相关论文

Continual Vision-Language Representation Learning with Off-Diagonal Information

Arxiv

0+阅读 · 2023年5月15日

Sequential model correction for nonlinear inverse problems

Arxiv

0+阅读 · 2023年5月12日

SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

Arxiv

0+阅读 · 2023年5月12日

Estimating and Maximizing Mutual Information for Knowledge Distillation

Arxiv

0+阅读 · 2023年5月11日

Active Learning in the Predict-then-Optimize Framework: A Margin-Based Approach

Arxiv

0+阅读 · 2023年5月11日

Evaluating Embedding APIs for Information Retrieval

Arxiv

0+阅读 · 2023年5月10日

Sparse Spatial Transformers for Few-Shot Learning

Arxiv

0+阅读 · 2023年5月10日

Cross-Modal Object Tracking: Modality-Aware Representations and A Unified Benchmark

Arxiv

14+阅读 · 2021年11月11日

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Arxiv

11+阅读 · 2020年10月20日

A survey on deep hashing for image retrieval

A survey on deep hashing for image retrieval

Arxiv

15+阅读 · 2020年6月10日

相关基金

基于结构约束的多模态学习理论和方法

国家自然科学基金

6+阅读 · 2014年12月31日

基于贝叶斯模型的鲁棒高光谱解混方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

基于成对约束的自适应半监督降维方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

变分框架下的一类非局部的椭圆问题

国家自然科学基金

0+阅读 · 2013年12月31日

多元函数的稀疏逼近与随机逼近

国家自然科学基金

1+阅读 · 2012年12月31日

基于鲁棒相似性测度的含噪图像分割的谱聚类方法

国家自然科学基金

0+阅读 · 2012年12月31日

多卫星导航系统时空基准的统一

国家自然科学基金

1+阅读 · 2012年12月31日

基于图的大规模异质信息网络的匹配查询关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

WGOS层次化语义系统模型研究

国家自然科学基金

1+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员