SemDup:通过语义重复在网络规模上进行数据效率高的学习</s> (SemDeDup: Data-efficient learning at web-scale through semantic deduplication) - 专知论文

会员服务 ·

0

Performer · Learning · MoDELS · 可辨认的 · 语义相似度 ·

2023 年 3 月 16 日

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

翻译：SemDup:通过语义重复在网络规模上进行数据效率高的学习

Amro Abbas,Kushal Tirumala,Dániel Simig,Surya Ganguli,Ari S. Morcos

Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here, we introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove semantic duplicates: data pairs which are semantically similar, but not exactly identical. Removing semantic duplicates preserves performance and speeds up learning. Analyzing a subset of LAION, we show that SemDeDup can remove 50% of the data with minimal performance loss, effectively halving training time. Moreover, performance increases out of distribution. Also, analyzing language models trained on C4, a partially curated dataset, we show that SemDeDup improves over prior approaches while providing efficiency gains. SemDeDup provides an example of how simple ways of leveraging quality embeddings can be used to make models learn faster with less data.

翻译：机器学习的进展在很大程度上是由数据大规模增加所驱动的。但是,大型网络规模的数据集,如LAION, 基本上没有完成, 无法查找准确的重复数据, 可能会留下很多冗余。在这里, 我们引入了 SemDeDup, 这是一种利用预先培训过的模型嵌入来识别和清除语义重复数据的方法: 数据对齐, 它们在语义上是相似的, 但并不完全相同。去除语义重复可以保存性能和加速学习。分析LAION的一个子集, 我们显示SemDeDup 能够以最小的性能损失来移除50%的数据, 有效地将培训时间减半。此外, 性能会从分布中增加。此外, 分析在C4 上培训的语言模型, 一个部分整理的数据集, 我们显示SemDeup 在提供效率收益的同时, 会比先前的方法改进。 SemDeup 提供了一个例子, 如何使用简单的方法来利用质量嵌入来让模型更快地学习。</s>

0

相关内容

Performer

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

专知

12+阅读 · 2018年6月9日

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

专知

30+阅读 · 2018年3月22日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

亚稳态分子间复合物（MIC）反应特性和微观点火机理的研究

国家自然科学基金

0+阅读 · 2015年12月31日

聚合物强化泡沫复合体系作用机理及驱油动力学模型研究

国家自然科学基金

0+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

冲蚀磨损诱导非晶合金涂层表面纳米化及自增强机理研究

国家自然科学基金

0+阅读 · 2014年12月31日

树枝状高分子改性Fe3O4-氧化石墨烯纳米杂化材料的制备及其对水溶液中重金属离子的吸附性能研究

国家自然科学基金

0+阅读 · 2013年12月31日

Cystatin B缺失与Prion疾病自噬作用机制的研究

国家自然科学基金

0+阅读 · 2011年12月31日

形貌可控的磁性纳米粒子负载催化剂的制备及催化的有机合成反应

国家自然科学基金

0+阅读 · 2011年12月31日

原位生成纳米粒子分子沉积复合膜的制备及其摩擦学特性研究

国家自然科学基金

0+阅读 · 2009年12月31日

聚合物与硬颗粒混合体系的动力学模拟

国家自然科学基金

1+阅读 · 2009年12月31日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

Understanding Noise-Augmented Training for Randomized Smoothing

Arxiv

0+阅读 · 2023年5月8日

Predicting nuclear masses with product-unit networks

Arxiv

0+阅读 · 2023年5月8日

Data Efficient Training with Imbalanced Label Sample Distribution for Fashion Detection

Arxiv

0+阅读 · 2023年5月7日

On Contrastive Learning of Semantic Similarity forCode to Code Search

Arxiv

0+阅读 · 2023年5月5日

On the nonlinear correlation of ML performance between data subpopulations

Arxiv

0+阅读 · 2023年5月4日

Differentiable Reasoning on Large Knowledge Bases and Natural Language

Arxiv

12+阅读 · 2019年12月17日

RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds

Arxiv

11+阅读 · 2019年11月25日

Infusing Knowledge into the Textual Entailment Task Using Graph Convolutional Networks

Infusing Knowledge into the Textual Entailment Task Using Graph Convolutional Networks

Arxiv

23+阅读 · 2019年11月5日

Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification

Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification

Arxiv

17+阅读 · 2019年9月9日

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Arxiv

14+阅读 · 2019年8月8日

VIP会员

文章信息

相关主题

语义相似度

相关VIP内容

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

操作系统智能体：基于多模态大模型（MLLM）的通用计算设备智能体综述

《美国太空军系统全生命周期建模、仿真与分析效能提升方案》最新84页报告

【博士论文】推进数据高效的深度学习：非参数 Transformer、主动测试与上下文学习

自主人工智能：未来战争是否将是自主化的？

相关资讯

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

专知

12+阅读 · 2018年6月9日

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

专知

30+阅读 · 2018年3月22日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

相关论文

Understanding Noise-Augmented Training for Randomized Smoothing

Arxiv

0+阅读 · 2023年5月8日

Predicting nuclear masses with product-unit networks

Arxiv

0+阅读 · 2023年5月8日

Data Efficient Training with Imbalanced Label Sample Distribution for Fashion Detection

Arxiv

0+阅读 · 2023年5月7日

On Contrastive Learning of Semantic Similarity forCode to Code Search

Arxiv

0+阅读 · 2023年5月5日

On the nonlinear correlation of ML performance between data subpopulations

Arxiv

0+阅读 · 2023年5月4日

Differentiable Reasoning on Large Knowledge Bases and Natural Language

Arxiv

12+阅读 · 2019年12月17日

RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds

Arxiv

11+阅读 · 2019年11月25日

Infusing Knowledge into the Textual Entailment Task Using Graph Convolutional Networks

Infusing Knowledge into the Textual Entailment Task Using Graph Convolutional Networks

Arxiv

23+阅读 · 2019年11月5日

Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification

Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification

Arxiv

17+阅读 · 2019年9月9日

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Arxiv

14+阅读 · 2019年8月8日

相关基金

亚稳态分子间复合物（MIC）反应特性和微观点火机理的研究

国家自然科学基金

0+阅读 · 2015年12月31日

聚合物强化泡沫复合体系作用机理及驱油动力学模型研究

国家自然科学基金

0+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

冲蚀磨损诱导非晶合金涂层表面纳米化及自增强机理研究

国家自然科学基金

0+阅读 · 2014年12月31日

树枝状高分子改性Fe3O4-氧化石墨烯纳米杂化材料的制备及其对水溶液中重金属离子的吸附性能研究

国家自然科学基金

0+阅读 · 2013年12月31日

Cystatin B缺失与Prion疾病自噬作用机制的研究

国家自然科学基金

0+阅读 · 2011年12月31日

形貌可控的磁性纳米粒子负载催化剂的制备及催化的有机合成反应

国家自然科学基金

0+阅读 · 2011年12月31日

原位生成纳米粒子分子沉积复合膜的制备及其摩擦学特性研究

国家自然科学基金

0+阅读 · 2009年12月31日

聚合物与硬颗粒混合体系的动力学模拟

国家自然科学基金

1+阅读 · 2009年12月31日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员