质量而非数量:关于CLIP数据集设计与强度之间的相互作用 (Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP) - 专知论文

会员服务 ·

0

稳健性 · INTERACT · 数据集 · 泛化理论 · MoDELS ·

2022 年 8 月 10 日

Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

翻译：质量而非数量:关于CLIP数据集设计与强度之间的相互作用

Thao Nguyen,Gabriel Ilharco,Mitchell Wortsman,Sewoong Oh,Ludwig Schmidt

Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources - YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design.

翻译：网络拉动的数据集使最近的图像文本模型,如CLIP(培训前语言图像控制)或Flamingo等图像文本模型的显著普及能力得以实现,但对于数据集创建过程却鲜为人知。在这项工作中,我们引入了6个公开数据源的测试台,即YFCC、LAION、概念说明、WIT、RedCaps、Shutterstock等,以调查培训前分发如何在CLIP中产生稳健性。此外,我们发现,培训前数据的数据在分布变化中的表现差异很大,没有单一的数据源占主导地位。此外,我们系统地研究这些数据源之间的相互作用,发现将多个来源结合起来不一定产生更好的模型,而是淡化了最佳个人数据源的稳健性。我们用简单环境的理论洞察来补充我们的经验结论,将培训数据合并在一起,也会削弱稳健性。此外,我们的理论模型为基于CLIP的数据过滤技术的成功提供了候选解释,最近在LION数据集集中使用的成功性技术。我们的总体结果表明,仅仅收集大量稳健的数据,而不是从LAION总设计网络进行进一步的数据化。

0

相关内容

稳健性

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

中国图象图形学学会CSIG

0+阅读 · 2021年11月8日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

CaWRKY转录因子家族在辣椒灰霉病抗性应答中的调控机理研究

国家自然科学基金

0+阅读 · 2015年12月31日

LHC能区强子-(多)奇异粒子关联及夸克物质性质的实验研究

国家自然科学基金

0+阅读 · 2014年12月31日

结构振动的非光滑控制方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

Beclin 1在阿尔茨海默病样神经元损伤中的调控机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

阵列摆式波浪发电装置的性能研究

国家自然科学基金

0+阅读 · 2013年12月31日

拟南芥DIF（DRIP1-Interacting Factor）在胁迫信号应答中的功能分析

国家自然科学基金

0+阅读 · 2012年12月31日

无线网络控制系统的调度与控制协同设计及其优化方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于液晶的可调谐WGM振荡器

国家自然科学基金

0+阅读 · 2009年12月31日

基于Compressive sensing理论的单探测器太赫兹成像技术

国家自然科学基金

0+阅读 · 2009年12月31日

Toward a Geometrical Understanding of Self-supervised Contrastive Learning

Toward a Geometrical Understanding of Self-supervised Contrastive Learning

Arxiv

0+阅读 · 2022年10月6日

SynBench: Task-Agnostic Benchmarking of Pretrained Representations using Synthetic Data

Arxiv

0+阅读 · 2022年10月6日

Structured Multi-task Learning for Molecular Property Prediction

Arxiv

0+阅读 · 2022年10月6日

On the Influence of Cognitive Styles on Users' Understanding of Explanations

Arxiv

0+阅读 · 2022年10月5日

Machine Unlearning of Features and Labels

Arxiv

0+阅读 · 2022年10月4日

Design of Artificial Noise for Physical Layer Security in Visible Light Systems with Clipping

Arxiv

0+阅读 · 2022年10月2日

The Final Ascent: When Bigger Models Generalize Worse on Noisy-Labeled Data

Arxiv

0+阅读 · 2022年9月30日

Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control

Arxiv

0+阅读 · 2022年9月29日

Learning Latent Representations to Influence Multi-Agent Interaction

Arxiv

11+阅读 · 2020年11月12日

Meta-Learning to Cluster

Meta-Learning to Cluster

Arxiv

17+阅读 · 2019年10月30日

VIP会员

文章信息

相关主题

相关VIP内容

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

数据智能体综述：新兴范式还是被高估的炒作？

海底战已至：美国构思海底安全战略 | 最新报告

【ICCV2025教程】视觉异常检测中的基础模型：进展、挑战与应用

美军将无人自主等新技术融入潜艇部队以更具杀伤力

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

中国图象图形学学会CSIG

0+阅读 · 2021年11月8日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

Toward a Geometrical Understanding of Self-supervised Contrastive Learning

Toward a Geometrical Understanding of Self-supervised Contrastive Learning

Arxiv

0+阅读 · 2022年10月6日

SynBench: Task-Agnostic Benchmarking of Pretrained Representations using Synthetic Data

Arxiv

0+阅读 · 2022年10月6日

Structured Multi-task Learning for Molecular Property Prediction

Arxiv

0+阅读 · 2022年10月6日

On the Influence of Cognitive Styles on Users' Understanding of Explanations

Arxiv

0+阅读 · 2022年10月5日

Machine Unlearning of Features and Labels

Arxiv

0+阅读 · 2022年10月4日

Design of Artificial Noise for Physical Layer Security in Visible Light Systems with Clipping

Arxiv

0+阅读 · 2022年10月2日

The Final Ascent: When Bigger Models Generalize Worse on Noisy-Labeled Data

Arxiv

0+阅读 · 2022年9月30日

Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control

Arxiv

0+阅读 · 2022年9月29日

Learning Latent Representations to Influence Multi-Agent Interaction

Arxiv

11+阅读 · 2020年11月12日

Meta-Learning to Cluster

Meta-Learning to Cluster

Arxiv

17+阅读 · 2019年10月30日

相关基金

CaWRKY转录因子家族在辣椒灰霉病抗性应答中的调控机理研究

国家自然科学基金

0+阅读 · 2015年12月31日

LHC能区强子-(多)奇异粒子关联及夸克物质性质的实验研究

国家自然科学基金

0+阅读 · 2014年12月31日

结构振动的非光滑控制方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

Beclin 1在阿尔茨海默病样神经元损伤中的调控机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

阵列摆式波浪发电装置的性能研究

国家自然科学基金

0+阅读 · 2013年12月31日

拟南芥DIF（DRIP1-Interacting Factor）在胁迫信号应答中的功能分析

国家自然科学基金

0+阅读 · 2012年12月31日

无线网络控制系统的调度与控制协同设计及其优化方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于液晶的可调谐WGM振荡器

国家自然科学基金

0+阅读 · 2009年12月31日

基于Compressive sensing理论的单探测器太赫兹成像技术

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员