高密度通过证检索单方和多个代表制 (On Single and Multiple Representations in Dense Passage Retrieval) - 专知论文

会员服务 ·

0

BM25 · ColBERT · 词元分析器 · 语言模型化 · INFORMS ·

2021 年 8 月 13 日

On Single and Multiple Representations in Dense Passage Retrieval

翻译：高密度通过证检索单方和多个代表制

Craig Macdonald,Nicola Tonellotto,Iadh Ounis

from arxiv, Published at the 11th Italian Information Retrieval Workshop (IIR 2021)

The advent of contextualised language models has brought gains in search effectiveness, not just when applied for re-ranking the output of classical weighting models such as BM25, but also when used directly for passage indexing and retrieval, a technique which is called dense retrieval. In the existing literature in neural ranking, two dense retrieval families have become apparent: single representation, where entire passages are represented by a single embedding (usually BERT's [CLS] token, as exemplified by the recent ANCE approach), or multiple representations, where each token in a passage is represented by its own embedding (as exemplified by the recent ColBERT approach). These two families have not been directly compared. However, because of the likely importance of dense retrieval moving forward, a clear understanding of their advantages and disadvantages is paramount. To this end, this paper contributes a direct study on their comparative effectiveness, noting situations where each method under/over performs w.r.t. each other, and w.r.t. a BM25 baseline. We observe that, while ANCE is more efficient than ColBERT in terms of response time and memory usage, multiple representations are statistically more effective than the single representations for MAP and MRR@10. We also show that multiple representations obtain better improvements than single representations for queries that are the hardest for BM25, as well as for definitional queries, and those with complex information needs.

翻译：环境化语言模型的出现带来了搜索效果,不仅当应用BM25等古典加权模型的输出重新排序时,而且当直接用于通过索引和检索时,这种技术被称为密集检索。在神经排序的现有文献中,两个密集的检索家庭变得显而易见:单一表示,其中整个段落由单一嵌入(通常是BERT的标志,如最近的NC 方法所示)代表,或多次表示,其中每个标志都由其本身的嵌入(如最近的ColBERT方法所示)代表。这两个家庭没有直接比较。然而,由于大量检索可能很重要,因此,明确了解其优缺点至关重要。为此,本文件直接研究了其相对有效性:指出每个在/过去采用的方法都用一个嵌入式(通常是BERT的标志,如最近的NC)表示,而BERT在反应时间和记忆使用方面比ColBERT更有效,但多面表示,对于M10 和MR 来说,从统计角度来说,多面表示比最难的表示要好。

0

相关内容

BM25

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

【ICML2020】学习支持外推的表示学习，Learning Representations that Support Extrapolation

【ICML2020】学习支持外推的表示学习，Learning Representations that Support Extrapolation

专知会员服务

26+阅读 · 2020年7月14日

【SIGIR2020】学习搜索查询的颜色表示，Learning Colour Representations of Search Queries

【SIGIR2020】学习搜索查询的颜色表示，Learning Colour Representations of Search Queries

专知会员服务

17+阅读 · 2020年6月18日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

41+阅读 · 2020年4月11日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【清华大学】Bert 简介，Bidirectional Encoder Representations from Transformers，21页ppt

【清华大学】Bert 简介，Bidirectional Encoder Representations from Transformers，21页ppt

专知会员服务

79+阅读 · 2019年12月29日

【斯坦福大学】对抗性表征主动学习，Adversarial Representation Active Learning

【斯坦福大学】对抗性表征主动学习，Adversarial Representation Active Learning

专知会员服务

45+阅读 · 2019年12月20日

【技术报告】诺亚开源中文预训练语言模型“哪吒”（NEZHA: Neural Contextualized Representation for Chinese Language Understanding）

【技术报告】诺亚开源中文预训练语言模型“哪吒”（NEZHA: Neural Contextualized Representation for Chinese Language Understanding）

专知会员服务

21+阅读 · 2019年12月12日

【AAAI 2019 Tutorial】超越单词的神经向量表示:句子和文档嵌入（Neural Vector Representations beyond Words: Sentence and Document Embeddings），Gerard de Melo

【AAAI 2019 Tutorial】超越单词的神经向量表示:句子和文档嵌入（Neural Vector Representations beyond Words: Sentence and Document Embeddings），Gerard de Melo

专知会员服务

19+阅读 · 2019年11月18日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

绝对干货！NLP预训练模型：从transformer到albert

绝对干货！NLP预训练模型：从transformer到albert

新智元

13+阅读 · 2019年11月10日

Successor representations 强化学习表示的生物学启发

Successor representations 强化学习表示的生物学启发

CreateAMind

6+阅读 · 2019年9月5日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【SIGIR2018】五篇对抗训练文章

【SIGIR2018】五篇对抗训练文章

专知

12+阅读 · 2018年7月9日

【论文推荐】最新十篇度量学习相关论文—可量化表示、非线性度量学习、在线深度量学习、大间隔最近邻、判别深度度量、域自适应

【论文推荐】最新十篇度量学习相关论文—可量化表示、非线性度量学习、在线深度量学习、大间隔最近邻、判别深度度量、域自适应

专知

12+阅读 · 2018年5月18日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval

Arxiv

6+阅读 · 2021年10月12日

Contrastive String Representation Learning using Synthetic Data

Contrastive String Representation Learning using Synthetic Data

Arxiv

0+阅读 · 2021年10月8日

Adversarial Retriever-Ranker for dense text retrieval

Arxiv

0+阅读 · 2021年10月8日

A Theory of Tournament Representations

Arxiv

0+阅读 · 2021年10月6日

Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval

Arxiv

4+阅读 · 2021年5月8日

Optimizing Dense Retrieval Model Training with Hard Negatives

Arxiv

5+阅读 · 2021年4月16日

Spatially Consistent Representation Learning

Arxiv

14+阅读 · 2021年3月10日

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Arxiv

11+阅读 · 2020年10月20日

Multi-Granularity Representations of Dialog

Arxiv

3+阅读 · 2019年8月26日

CEDR: Contextualized Embeddings for Document Ranking

Arxiv

4+阅读 · 2019年8月19日

VIP会员

文章信息

相关主题

词元分析器

语言模型化

相关VIP内容

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

【ICML2020】学习支持外推的表示学习，Learning Representations that Support Extrapolation

【ICML2020】学习支持外推的表示学习，Learning Representations that Support Extrapolation

专知会员服务

26+阅读 · 2020年7月14日

【SIGIR2020】学习搜索查询的颜色表示，Learning Colour Representations of Search Queries

【SIGIR2020】学习搜索查询的颜色表示，Learning Colour Representations of Search Queries

专知会员服务

17+阅读 · 2020年6月18日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

41+阅读 · 2020年4月11日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【清华大学】Bert 简介，Bidirectional Encoder Representations from Transformers，21页ppt

【清华大学】Bert 简介，Bidirectional Encoder Representations from Transformers，21页ppt

专知会员服务

79+阅读 · 2019年12月29日

【斯坦福大学】对抗性表征主动学习，Adversarial Representation Active Learning

【斯坦福大学】对抗性表征主动学习，Adversarial Representation Active Learning

专知会员服务

45+阅读 · 2019年12月20日

【技术报告】诺亚开源中文预训练语言模型“哪吒”（NEZHA: Neural Contextualized Representation for Chinese Language Understanding）

【技术报告】诺亚开源中文预训练语言模型“哪吒”（NEZHA: Neural Contextualized Representation for Chinese Language Understanding）

专知会员服务

21+阅读 · 2019年12月12日

【AAAI 2019 Tutorial】超越单词的神经向量表示:句子和文档嵌入（Neural Vector Representations beyond Words: Sentence and Document Embeddings），Gerard de Melo

【AAAI 2019 Tutorial】超越单词的神经向量表示:句子和文档嵌入（Neural Vector Representations beyond Words: Sentence and Document Embeddings），Gerard de Melo

专知会员服务

19+阅读 · 2019年11月18日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

人工智能治理的未来

模态感知的特征匹配：单一模态与跨模态技术的全面综述

无监督行人重识别研究综述

【牛津博士论文】面向神经影像应用的可扩展且可解释的空间模型

相关资讯

绝对干货！NLP预训练模型：从transformer到albert

绝对干货！NLP预训练模型：从transformer到albert

新智元

13+阅读 · 2019年11月10日

Successor representations 强化学习表示的生物学启发

Successor representations 强化学习表示的生物学启发

CreateAMind

6+阅读 · 2019年9月5日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【SIGIR2018】五篇对抗训练文章

【SIGIR2018】五篇对抗训练文章

专知

12+阅读 · 2018年7月9日

【论文推荐】最新十篇度量学习相关论文—可量化表示、非线性度量学习、在线深度量学习、大间隔最近邻、判别深度度量、域自适应

【论文推荐】最新十篇度量学习相关论文—可量化表示、非线性度量学习、在线深度量学习、大间隔最近邻、判别深度度量、域自适应

专知

12+阅读 · 2018年5月18日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

相关论文

Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval

Arxiv

6+阅读 · 2021年10月12日

Contrastive String Representation Learning using Synthetic Data

Contrastive String Representation Learning using Synthetic Data

Arxiv

0+阅读 · 2021年10月8日

Adversarial Retriever-Ranker for dense text retrieval

Arxiv

0+阅读 · 2021年10月8日

A Theory of Tournament Representations

Arxiv

0+阅读 · 2021年10月6日

Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval

Arxiv

4+阅读 · 2021年5月8日

Optimizing Dense Retrieval Model Training with Hard Negatives

Arxiv

5+阅读 · 2021年4月16日

Spatially Consistent Representation Learning

Arxiv

14+阅读 · 2021年3月10日

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Arxiv

11+阅读 · 2020年10月20日

Multi-Granularity Representations of Dialog

Arxiv

3+阅读 · 2019年8月26日

CEDR: Contextualized Embeddings for Document Ranking

Arxiv

4+阅读 · 2019年8月19日

微信扫码咨询专知VIP会员