能否通过下游数据推动基于原始文本的自我监督学习? 理论分析 (Can Pretext-Based Self-Supervised Learning Be Boosted by Downstream Data? A Theoretical Analysis) - 专知论文

会员服务 ·

0

Boosting（一种模型训练加速方式） · 样本复杂度 · 学成 · 未标记 · 条件独立的 ·

2022 年 2 月 23 日

Can Pretext-Based Self-Supervised Learning Be Boosted by Downstream Data? A Theoretical Analysis

翻译：能否通过下游数据推动基于原始文本的自我监督学习? 理论分析

Jiaye Teng,Weiran Huang,Haowei He

from arxiv, Accepted by AISTATS 2022

Pretext-based self-supervised learning learns the semantic representation via a handcrafted pretext task over unlabeled data and then uses the learned representation for downstream tasks, which effectively reduces the sample complexity of downstream tasks under Conditional Independence (CI) condition. However, the downstream sample complexity gets much worse if the CI condition does not hold. One interesting question is whether we can make the CI condition hold by using downstream data to refine the unlabeled data to boost self-supervised learning. At first glance, one might think that seeing downstream data in advance would always boost the downstream performance. However, we show that it is not intuitively true and point out that in some cases, it hurts the final performance instead. In particular, we prove both model-free and model-dependent lower bounds of the number of downstream samples used for data refinement. Moreover, we conduct various experiments on both synthetic and real-world datasets to verify our theoretical results.

翻译：以文字为基础的自我监督的学习通过手工制作的借口任务,对未贴标签的数据学习语义表达方式,然后对下游任务使用所学的代言方式,这有效地降低了在有条件独立条件下下游任务的抽样复杂性。然而,如果光学独立条件不起作用,下游样本的复杂性就会大为恶化。一个令人感兴趣的问题是,我们是否能够通过使用下游数据改进未贴标签的数据来维持CI的条件,以促进自我监督的学习。乍一看,人们可能会认为,提前看到下游数据将总是会提高下游的性能。然而,我们表明,这并非直觉真实,并指出,在某些情况下,它会损害最后的性能。特别是,我们证明,在用于改进数据的下游样本数量方面,没有模型,而且模式依赖较低界限。此外,我们还在合成和现实世界数据集进行各种实验,以核实我们的理论结果。

0

相关内容

Boosting（一种模型训练加速方式）

Boosting（一种模型训练加速方式）

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

全球首个GNN为主的AI创业公司，募资$18.5 million！

全球首个GNN为主的AI创业公司，募资$18.5 million！

图与推荐

1+阅读 · 2022年4月16日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

基于工业大数据挖掘的复杂产品总完工时间动态预测

国家自然科学基金

4+阅读 · 2015年12月31日

玉米响应干旱胁迫的甲基化调控与分子机制解析

国家自然科学基金

0+阅读 · 2014年12月31日

基于数据增强与匹配的自发地理信息数据质量评价方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

局部半完全有向图的分解及相关问题的研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于数据分布评估和支持向量机方法的分布式数据流挖掘模型和算法研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于分形与数据流挖掘技术的动态数据挖掘方法及其应用研究

国家自然科学基金

1+阅读 · 2012年12月31日

关于图顶点划分的 Thomassen 猜想

国家自然科学基金

0+阅读 · 2011年12月31日

重载齿轮箱复杂工况多源激励下复合故障耦合机理及诊断方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于分布式水文模型的流域尺度土壤湿度遥感数据同化研究

国家自然科学基金

0+阅读 · 2009年12月31日

全纯函数空间上的复合算子理论研究

国家自然科学基金

0+阅读 · 2009年12月31日

The White-Box Adversarial Data Stream Model

Arxiv

0+阅读 · 2022年4月19日

The Devil is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-Training

Arxiv

0+阅读 · 2022年4月18日

On Effectively Learning of Knowledge in Continual Pre-training

Arxiv

0+阅读 · 2022年4月17日

METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals

Arxiv

0+阅读 · 2022年4月16日

An error analysis of generative adversarial networks for learning distributions

Arxiv

0+阅读 · 2022年4月16日

Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation

Arxiv

11+阅读 · 2021年12月16日

MATCH: Metadata-Aware Text Classification in A Large Hierarchy

Arxiv

12+阅读 · 2021年2月15日

Look-into-Object: Self-supervised Structure Modeling for Object Recognition

Look-into-Object: Self-supervised Structure Modeling for Object Recognition

Arxiv

15+阅读 · 2020年3月31日

Meta-Learning to Cluster

Meta-Learning to Cluster

Arxiv

17+阅读 · 2019年10月30日

End-to-End Multi-Task Learning with Attention

Arxiv

19+阅读 · 2018年3月28日

VIP会员

文章信息

相关主题

Boosting（一种模型训练加速方式）

样本复杂度

条件独立的

相关VIP内容

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【NeurIPS2025】语义提示扩散变换器的像素级精确深度估计

俄乌冲突的地缘政治与军事教训（万字长文）

【博士论文】弥合多模态基础模型与世界模型之间的鸿沟

量子增强计算机视觉：超越经典算法

相关资讯

全球首个GNN为主的AI创业公司，募资$18.5 million！

全球首个GNN为主的AI创业公司，募资$18.5 million！

图与推荐

1+阅读 · 2022年4月16日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

相关论文

The White-Box Adversarial Data Stream Model

Arxiv

0+阅读 · 2022年4月19日

The Devil is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-Training

Arxiv

0+阅读 · 2022年4月18日

On Effectively Learning of Knowledge in Continual Pre-training

Arxiv

0+阅读 · 2022年4月17日

METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals

Arxiv

0+阅读 · 2022年4月16日

An error analysis of generative adversarial networks for learning distributions

Arxiv

0+阅读 · 2022年4月16日

Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation

Arxiv

11+阅读 · 2021年12月16日

MATCH: Metadata-Aware Text Classification in A Large Hierarchy

Arxiv

12+阅读 · 2021年2月15日

Look-into-Object: Self-supervised Structure Modeling for Object Recognition

Look-into-Object: Self-supervised Structure Modeling for Object Recognition

Arxiv

15+阅读 · 2020年3月31日

Meta-Learning to Cluster

Meta-Learning to Cluster

Arxiv

17+阅读 · 2019年10月30日

End-to-End Multi-Task Learning with Attention

Arxiv

19+阅读 · 2018年3月28日

相关基金

基于工业大数据挖掘的复杂产品总完工时间动态预测

国家自然科学基金

4+阅读 · 2015年12月31日

玉米响应干旱胁迫的甲基化调控与分子机制解析

国家自然科学基金

0+阅读 · 2014年12月31日

基于数据增强与匹配的自发地理信息数据质量评价方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

局部半完全有向图的分解及相关问题的研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于数据分布评估和支持向量机方法的分布式数据流挖掘模型和算法研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于分形与数据流挖掘技术的动态数据挖掘方法及其应用研究

国家自然科学基金

1+阅读 · 2012年12月31日

关于图顶点划分的 Thomassen 猜想

国家自然科学基金

0+阅读 · 2011年12月31日

重载齿轮箱复杂工况多源激励下复合故障耦合机理及诊断方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于分布式水文模型的流域尺度土壤湿度遥感数据同化研究

国家自然科学基金

0+阅读 · 2009年12月31日

全纯函数空间上的复合算子理论研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员