优化机器学习的数据收集 (Optimizing Data Collection for Machine Learning) - 专知论文

会员服务 ·

0

Learning · 优化器 · Performer · Machine Learning · 可约的 ·

2022 年 10 月 3 日

Optimizing Data Collection for Machine Learning

翻译：优化机器学习的数据收集

Rafid Mahmood,James Lucas,Jose M. Alvarez,Sanja Fidler,Marc T. Law

from arxiv, Accepted to NeurIPS 2022

Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workflows. We propose a new paradigm for modeling the data collection workflow as a formal optimal data collection problem that allows designers to specify performance targets, collection costs, a time horizon, and penalties for failing to meet the targets. Additionally, this formulation generalizes to tasks requiring multiple data sources, such as labeled and unlabeled data used in semi-supervised learning. To solve our problem, we develop Learn-Optimize-Collect (LOC), which minimizes expected future collection costs. Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs.

翻译：现代深层学习系统需要庞大的数据集才能取得令人印象深刻的业绩,但对于需要收集多少或哪类数据却没有多少指导。过度收集数据会造成不必要的当前费用,而收集不足则可能带来未来费用和延误工作流程。我们提出一个新的模式,建模数据收集工作流程,作为正式的最佳数据收集问题,使设计者能够具体规定业绩目标、收集费用、时间范围,以及未能实现目标的处罚。此外,这一提法概括了需要多种数据来源的任务,如在半监督学习中使用的标签和未标记数据。为了解决我们的问题,我们开发了“学习-优化-聚合”(LOC),这可以最大限度地减少预期的未来收集费用。最后,我们用数字比较了我们的框架,通过从神经缩放法中外推来估算数据需求的传统基线。我们大幅降低在若干分类、分解和检测任务上未能达到预期业绩目标的风险,同时保持较低的总收集费用。

0

相关内容

Learning

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

哥伦比亚大学最新《机器学习》课程，Fall-B 2020 (Machine Learning)

专知会员服务

39+阅读 · 2020年11月3日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

59+阅读 · 2020年1月25日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

《数学学报》期刊

国家自然科学基金

5+阅读 · 2015年12月31日

TCR-CDR3片段ASSLSGENTEAF在肝脏移植急性免疫排斥过程中的调控作用与机制研究

国家自然科学基金

1+阅读 · 2015年12月31日

长链非编码RNA TUG1调控牙周膜干细胞成骨分化及组织再生的研究

国家自然科学基金

0+阅读 · 2014年12月31日

LOC283683-NIPA1-BMPRII途径对胆固醇平衡和动脉粥样硬化的影响及机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

Kronheimer-Nakajima quiver 模空间与有理曲面

国家自然科学基金

1+阅读 · 2013年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

带约束和参数的多变量逼近的理论与方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

考虑水合物微观分布的沉积物力学特性研究

国家自然科学基金

0+阅读 · 2012年12月31日

Tim-3-Galectin-9通路在妊娠早期母-胎界面免疫调控中的作用研究

国家自然科学基金

0+阅读 · 2009年12月31日

水稻OsCAS（Calcium-sensing Receptor）基因的功能分析

国家自然科学基金

0+阅读 · 2009年12月31日

The Effects of Data Quality on Machine Learning Performance

Arxiv

0+阅读 · 2022年11月9日

Discrimination and Class Imbalance Aware Online Naive Bayes

Discrimination and Class Imbalance Aware Online Naive Bayes

Arxiv

0+阅读 · 2022年11月9日

Strategy to select most efficient RCT samples based on observational data

Arxiv

0+阅读 · 2022年11月9日

Beyond Conjugacy for Chain Event Graph Model Selection

Arxiv

0+阅读 · 2022年11月7日

Distributed Online Learning Algorithm With Differential Privacy Strategy for Convex Nondecomposable Global Objectives

Arxiv

0+阅读 · 2022年11月5日

A Survey of Learning on Small Data

Arxiv

19+阅读 · 2022年7月29日

A continual learning survey: Defying forgetting in classification tasks

Arxiv

32+阅读 · 2021年4月16日

A Survey on Edge Intelligence

A Survey on Edge Intelligence

Arxiv

52+阅读 · 2020年3月26日

Taking Human out of Learning Applications: A Survey on Automated Machine Learning

Taking Human out of Learning Applications: A Survey on Automated Machine Learning

Arxiv

14+阅读 · 2019年1月17日

Deep Learning on Graphs: A Survey

Arxiv

53+阅读 · 2018年12月11日

VIP会员

文章信息

相关主题

Machine Learning

相关VIP内容

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

哥伦比亚大学最新《机器学习》课程，Fall-B 2020 (Machine Learning)

专知会员服务

39+阅读 · 2020年11月3日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

59+阅读 · 2020年1月25日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

人机协同作战规划：来自美海军陆战队的大语言模型（LLM）使用教训

对北约军事总部战略规划制定与实施的研究 | 140页

美联参会指南-联合规划与执行概述及政策框架 | 32页

俄罗斯军事规划差异性凸显其思维的重要性 | 2025最新文献

相关资讯

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

The Effects of Data Quality on Machine Learning Performance

Arxiv

0+阅读 · 2022年11月9日

Discrimination and Class Imbalance Aware Online Naive Bayes

Discrimination and Class Imbalance Aware Online Naive Bayes

Arxiv

0+阅读 · 2022年11月9日

Strategy to select most efficient RCT samples based on observational data

Arxiv

0+阅读 · 2022年11月9日

Beyond Conjugacy for Chain Event Graph Model Selection

Arxiv

0+阅读 · 2022年11月7日

Distributed Online Learning Algorithm With Differential Privacy Strategy for Convex Nondecomposable Global Objectives

Arxiv

0+阅读 · 2022年11月5日

A Survey of Learning on Small Data

Arxiv

19+阅读 · 2022年7月29日

A continual learning survey: Defying forgetting in classification tasks

Arxiv

32+阅读 · 2021年4月16日

A Survey on Edge Intelligence

A Survey on Edge Intelligence

Arxiv

52+阅读 · 2020年3月26日

Taking Human out of Learning Applications: A Survey on Automated Machine Learning

Taking Human out of Learning Applications: A Survey on Automated Machine Learning

Arxiv

14+阅读 · 2019年1月17日

Deep Learning on Graphs: A Survey

Arxiv

53+阅读 · 2018年12月11日

相关基金

《数学学报》期刊

国家自然科学基金

5+阅读 · 2015年12月31日

TCR-CDR3片段ASSLSGENTEAF在肝脏移植急性免疫排斥过程中的调控作用与机制研究

国家自然科学基金

1+阅读 · 2015年12月31日

长链非编码RNA TUG1调控牙周膜干细胞成骨分化及组织再生的研究

国家自然科学基金

0+阅读 · 2014年12月31日

LOC283683-NIPA1-BMPRII途径对胆固醇平衡和动脉粥样硬化的影响及机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

Kronheimer-Nakajima quiver 模空间与有理曲面

国家自然科学基金

1+阅读 · 2013年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

带约束和参数的多变量逼近的理论与方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

考虑水合物微观分布的沉积物力学特性研究

国家自然科学基金

0+阅读 · 2012年12月31日

Tim-3-Galectin-9通路在妊娠早期母-胎界面免疫调控中的作用研究

国家自然科学基金

0+阅读 · 2009年12月31日

水稻OsCAS（Calcium-sensing Receptor）基因的功能分析

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员