质量而非数量:关于CLIP数据集设计与强度之间的相互作用 (Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP) - 专知论文

会员服务 ·

0

稳健性 · INTERACT · 数据集 · 泛化理论 · MoDELS ·

2023 年 2 月 1 日

Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

翻译：质量而非数量:关于CLIP数据集设计与强度之间的相互作用

Thao Nguyen,Gabriel Ilharco,Mitchell Wortsman,Sewoong Oh,Ludwig Schmidt

from arxiv, Oral paper at NeurIPS 2022

Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources - YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design. Code is available at https://github.com/mlfoundations/clip_quality_not_quantity.

翻译：网络拖动的数据集使得最近的图像文本模型,如CLIP(CLIP(培训前语言图像测试)或Flamingo,能够显著的概括化能力,如最近的图像文本模型,如CLIP(培训前语言图像测试)或Flammingo,但对于数据集创建过程却鲜为人知。在这项工作中,我们引入了6个公开数据源的测试台——YFCC、LAION、概念说明、WIT、REDCaps、Shutterstock——以调查培训前分发如何在CLIP中产生稳健性。此外,我们发现培训前的数据数据在分布变化中差异很大,没有单一的数据源主导。此外,我们系统研究这些数据源之间的相互作用,发现将多个来源合并不一定产生更好的模型,而是淡化了最佳个人数据源的稳健性。我们用简单的环境的理论洞察来补充我们的经验结论,其中将培训数据合并起来也会削弱稳健性。此外,我们的理论模型为基于CLIP的数据过滤技术的成功提供了候选解释,最近在LION数据集集中采用的技术。总体结果表明,仅仅收集大量现有数据在一般设计中进行大量数据质量研究时无法进一步构建。

0

相关内容

稳健性

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

31+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

深度强化学习实验室

1+阅读 · 2022年1月11日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

大数据 | 顶级SCI期刊专刊/国际会议信息7条

大数据 | 顶级SCI期刊专刊/国际会议信息7条

Call4Papers

10+阅读 · 2018年12月29日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

线粒体TRAP1分子介导Ago2蛋白表达在肠癌转移中的作用机制

国家自然科学基金

0+阅读 · 2014年12月31日

γ-Synuclein调控MAPK-ERK-JNK信号通路及细胞周期促进子宫内膜癌恶性进展的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

稀土MOF纳米荧光探针的设计合成及其生物应用

国家自然科学基金

0+阅读 · 2013年12月31日

胃癌和结直肠癌共性预后相关分子研究

国家自然科学基金

0+阅读 · 2013年12月31日

4f和3d电子调控下的新型In和Te基稀土1：3型半导体化合物的磁输运和结构

国家自然科学基金

0+阅读 · 2012年12月31日

EB病毒LMP2特异性结合affibody分子的筛选及特性研究

国家自然科学基金

0+阅读 · 2012年12月31日

鸡毒支原体感染相关miRNAs鉴定及其分子调控机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

缺氧时HIF-1α转录激活自噬蛋白Beclin 1促进鼻咽癌转移机制的研究

国家自然科学基金

0+阅读 · 2012年12月31日

可调控稀土高分子配合物设计与发光性质研究

国家自然科学基金

0+阅读 · 2011年12月31日

甘薯AGPase基因TRAP分子标记筛选及高淀粉育种新策略研究

国家自然科学基金

0+阅读 · 2008年12月31日

'I am both here and there' Parallel Control of Multiple Robotic Avatars by Disabled Workers in a Café

Arxiv

0+阅读 · 2023年3月24日

$k$NN Prompting: Beyond-Context Learning with Calibration-Free Nearest Neighbor Inference

Arxiv

0+阅读 · 2023年3月24日

The Quantization Model of Neural Scaling

Arxiv

0+阅读 · 2023年3月23日

Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse

Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse

Arxiv

0+阅读 · 2023年3月23日

Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation

Arxiv

0+阅读 · 2023年3月23日

Feature-Conditioned Cascaded Video Diffusion Models for Precise Echocardiogram Synthesis

Arxiv

0+阅读 · 2023年3月23日

On the Importance and Applicability of Pre-Training for Federated Learning

Arxiv

0+阅读 · 2023年3月23日

On the Complexity of Robust Multi-Stage Problems in the Polynomial Hierarchy

Arxiv

0+阅读 · 2023年3月21日

On the Opportunities and Risks of Foundation Models

Arxiv

30+阅读 · 2021年8月18日

A Survey of Human-in-the-loop for Machine Learning

Arxiv

37+阅读 · 2021年8月2日

VIP会员

文章信息

相关主题

相关VIP内容

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

31+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

视觉-语言-动作模型解析：从模块构成到里程碑与挑战

《解析陆域作战方向：一个概念性框架》报告

【博士论文】基于多模态基础模型的上下文学习

追寻真正的AI自主性：从遗留思维到战场优势

相关资讯

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

深度强化学习实验室

1+阅读 · 2022年1月11日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

大数据 | 顶级SCI期刊专刊/国际会议信息7条

大数据 | 顶级SCI期刊专刊/国际会议信息7条

Call4Papers

10+阅读 · 2018年12月29日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

'I am both here and there' Parallel Control of Multiple Robotic Avatars by Disabled Workers in a Café

Arxiv

0+阅读 · 2023年3月24日

$k$NN Prompting: Beyond-Context Learning with Calibration-Free Nearest Neighbor Inference

Arxiv

0+阅读 · 2023年3月24日

The Quantization Model of Neural Scaling

Arxiv

0+阅读 · 2023年3月23日

Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse

Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse

Arxiv

0+阅读 · 2023年3月23日

Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation

Arxiv

0+阅读 · 2023年3月23日

Feature-Conditioned Cascaded Video Diffusion Models for Precise Echocardiogram Synthesis

Arxiv

0+阅读 · 2023年3月23日

On the Importance and Applicability of Pre-Training for Federated Learning

Arxiv

0+阅读 · 2023年3月23日

On the Complexity of Robust Multi-Stage Problems in the Polynomial Hierarchy

Arxiv

0+阅读 · 2023年3月21日

On the Opportunities and Risks of Foundation Models

Arxiv

30+阅读 · 2021年8月18日

A Survey of Human-in-the-loop for Machine Learning

Arxiv

37+阅读 · 2021年8月2日

相关基金

线粒体TRAP1分子介导Ago2蛋白表达在肠癌转移中的作用机制

国家自然科学基金

0+阅读 · 2014年12月31日

γ-Synuclein调控MAPK-ERK-JNK信号通路及细胞周期促进子宫内膜癌恶性进展的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

稀土MOF纳米荧光探针的设计合成及其生物应用

国家自然科学基金

0+阅读 · 2013年12月31日

胃癌和结直肠癌共性预后相关分子研究

国家自然科学基金

0+阅读 · 2013年12月31日

4f和3d电子调控下的新型In和Te基稀土1：3型半导体化合物的磁输运和结构

国家自然科学基金

0+阅读 · 2012年12月31日

EB病毒LMP2特异性结合affibody分子的筛选及特性研究

国家自然科学基金

0+阅读 · 2012年12月31日

鸡毒支原体感染相关miRNAs鉴定及其分子调控机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

缺氧时HIF-1α转录激活自噬蛋白Beclin 1促进鼻咽癌转移机制的研究

国家自然科学基金

0+阅读 · 2012年12月31日

可调控稀土高分子配合物设计与发光性质研究

国家自然科学基金

0+阅读 · 2011年12月31日

甘薯AGPase基因TRAP分子标记筛选及高淀粉育种新策略研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员