常量通行证检索语言模型 (Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval) - 专知论文

会员服务 ·

0

语言模型化 · 掩码 · INFORMS · PTM · MoDELS ·

2022 年 10 月 27 日

Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval

翻译：常量通行证检索语言模型

Dingkun Long,Yanzhao Zhang,Guangwei Xu,Pengjun Xie

from arxiv, Search LM part of the "AliceMind SLM + HLAR" method in MS MARCO Passage Ranking Leaderboard Submission

Pre-trained language model (PTM) has been shown to yield powerful text representations for dense passage retrieval task. The Masked Language Modeling (MLM) is a major sub-task of the pre-training process. However, we found that the conventional random masking strategy tend to select a large number of tokens that have limited effect on the passage retrieval task (e,g. stop-words and punctuation). By noticing the term importance weight can provide valuable information for passage retrieval, we hereby propose alternative retrieval oriented masking (dubbed as ROM) strategy where more important tokens will have a higher probability of being masked out, to capture this straightforward yet essential information to facilitate the language model pre-training process. Notably, the proposed new token masking method will not change the architecture and learning objective of original PTM. Our experiments verify that the proposed ROM enables term importance information to help language model pre-training thus achieving better performance on multiple passage retrieval benchmarks.

翻译：培训前语言模型(PTM)被证明能为密集通道检索任务产生强大的文字表示力。蒙面语言模型(MLM)是培训前进程的一个主要子任务。然而,我们发现常规随机掩码战略倾向于选择对通道检索任务影响有限的大量象征性物(例如中继词和标点),通过注意到该术语的重要性重量可以为通道检索提供宝贵的信息,我们在此建议采用其他以检索为导向的掩码(作为ROM)战略,其中更重要的符号更有可能被遮盖出来,以捕捉便利语言模型培训前进程这一直截了当但必不可少的信息。值得注意的是,拟议的新的代号掩码方法不会改变原始PTM的架构和学习目标。我们的实验证实,拟议的ROM使术语重要信息能够帮助语言模型的预培训,从而在多个通道检索基准上取得更好的业绩。

0

相关内容

语言模型化

语言模型化

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【伯克利】自回归模型的局部掩卷积，Locally Masked Convolution for Autoregressive Models

【伯克利】自回归模型的局部掩卷积，Locally Masked Convolution for Autoregressive Models

专知会员服务

20+阅读 · 2020年6月23日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文】图上的表示学习综述

【论文】图上的表示学习综述

机器学习研究会

15+阅读 · 2017年9月24日

两类带导数的非线性Schrodinger方程拟周期解的存在性

国家自然科学基金

0+阅读 · 2015年12月31日

连续体结构动力响应下的拓扑优化方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于高光谱数据的交叉定标光谱特性差异订正

国家自然科学基金

0+阅读 · 2013年12月31日

Reality-based Interaction用户界面模型和评估方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

家蚕传染性软化病病毒翻译机制及侵染机制的研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于可见光引入金属-金属间电荷转移的新型光催化剂设计与研究

国家自然科学基金

0+阅读 · 2009年12月31日

几何计算与表示中的约束优化方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

胰岛素抵抗在阿尔茨海默病发病机制中重要作用的研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

面向查询的XML文本自动文摘研究

国家自然科学基金

0+阅读 · 2008年12月31日

MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers

Arxiv

0+阅读 · 2022年12月15日

Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

Arxiv

0+阅读 · 2022年12月13日

In Defense of Cross-Encoders for Zero-Shot Retrieval

In Defense of Cross-Encoders for Zero-Shot Retrieval

Arxiv

0+阅读 · 2022年12月12日

LEAD: Liberal Feature-based Distillation for Dense Retrieval

Arxiv

0+阅读 · 2022年12月10日

Pre-training Methods in Information Retrieval

Arxiv

16+阅读 · 2021年11月27日

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Arxiv

11+阅读 · 2020年10月20日

Pre-training Text Representations as Meta Learning

Arxiv

13+阅读 · 2020年4月12日

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Arxiv

15+阅读 · 2020年2月28日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

Attention-based Ensemble for Deep Metric Learning

Arxiv

17+阅读 · 2018年4月2日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【伯克利】自回归模型的局部掩卷积，Locally Masked Convolution for Autoregressive Models

【伯克利】自回归模型的局部掩卷积，Locally Masked Convolution for Autoregressive Models

专知会员服务

20+阅读 · 2020年6月23日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

人工智能治理的未来

模态感知的特征匹配：单一模态与跨模态技术的全面综述

无监督行人重识别研究综述

【牛津博士论文】面向神经影像应用的可扩展且可解释的空间模型

相关资讯

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文】图上的表示学习综述

【论文】图上的表示学习综述

机器学习研究会

15+阅读 · 2017年9月24日

相关论文

MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers

Arxiv

0+阅读 · 2022年12月15日

Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

Arxiv

0+阅读 · 2022年12月13日

In Defense of Cross-Encoders for Zero-Shot Retrieval

In Defense of Cross-Encoders for Zero-Shot Retrieval

Arxiv

0+阅读 · 2022年12月12日

LEAD: Liberal Feature-based Distillation for Dense Retrieval

Arxiv

0+阅读 · 2022年12月10日

Pre-training Methods in Information Retrieval

Arxiv

16+阅读 · 2021年11月27日

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Arxiv

11+阅读 · 2020年10月20日

Pre-training Text Representations as Meta Learning

Arxiv

13+阅读 · 2020年4月12日

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Arxiv

15+阅读 · 2020年2月28日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

Attention-based Ensemble for Deep Metric Learning

Arxiv

17+阅读 · 2018年4月2日

相关基金

两类带导数的非线性Schrodinger方程拟周期解的存在性

国家自然科学基金

0+阅读 · 2015年12月31日

连续体结构动力响应下的拓扑优化方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于高光谱数据的交叉定标光谱特性差异订正

国家自然科学基金

0+阅读 · 2013年12月31日

Reality-based Interaction用户界面模型和评估方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

家蚕传染性软化病病毒翻译机制及侵染机制的研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于可见光引入金属-金属间电荷转移的新型光催化剂设计与研究

国家自然科学基金

0+阅读 · 2009年12月31日

几何计算与表示中的约束优化方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

胰岛素抵抗在阿尔茨海默病发病机制中重要作用的研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

面向查询的XML文本自动文摘研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员