通过数据质量分析为ML进行自动可行性研究:标签噪音案例研究 (Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise) - 专知论文

会员服务 ·

0

贝叶斯误差 · Analysis · 可行 · 估计/估计量 · Learning ·

2022 年 8 月 30 日

Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise

翻译：通过数据质量分析为ML进行自动可行性研究:标签噪音案例研究

Cedric Renggli,Luka Rimanic,Luka Kolar,Wentao Wu,Ce Zhang

In our experience of working with domain experts who are using today's AutoML systems, a common problem we encountered is what we call "unrealistic expectations" -- when users are facing a very challenging task with a noisy data acquisition process, while being expected to achieve startlingly high accuracy with machine learning (ML). Many of these are predestined to fail from the beginning. In traditional software engineering, this problem is addressed via a feasibility study, an indispensable step before developing any software system. In this paper, we present Snoopy, with the goal of supporting data scientists and machine learning engineers performing a systematic and theoretically founded feasibility study before building ML applications. We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error rate (BER), which stems from data quality issues in datasets used to train or evaluate ML model artifacts. We design a practical Bayes error estimator that is compared against baseline feasibility study candidates on 6 datasets (with additional real and synthetic noise of different levels) in computer vision and natural language processing. Furthermore, by including our systematic feasibility study with additional signals into the iterative label cleaning process, we demonstrate in end-to-end experiments how users are able to save substantial labeling time and monetary efforts.

翻译：在与使用今天的自动ML系统的域专家合作的经验中,我们遇到的一个共同问题是我们所谓的“不现实期望”——当用户面对一个非常艰巨的任务时,数据获取过程过于繁琐,而用户则面临一个非常艰巨的任务,而机器学习(ML)则预期会达到惊人的高精度。其中许多问题注定从一开始就会失败。在传统的软件工程中,这个问题是通过可行性研究来解决的,这是开发任何软件系统之前不可或缺的一步。在本文中,我们介绍Snoopy,目的是支持数据科学家和机器学习工程师在建立ML应用程序之前进行系统、理论基础的可行性研究。我们处理这一问题的方法是估计基本任务不可避免的错误,即所谓的Bayes错误率(BER),这源于用于培训或评价ML模型文物的数据集的数据质量问题。我们设计了一个实用的海湾错误估计器,与计算机视觉和自然语言处理中的6个数据集(以及不同级别的其他真实和合成噪音)的基线可行性研究对象进行比较。此外,我们用系统的可行性研究,将更多的信号纳入迭代标签清理过程,我们如何在最终进行重大的实验。

0

相关内容

贝叶斯误差

贝叶斯误差

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

中国图象图形学学会CSIG

0+阅读 · 2021年12月17日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

SIRT1介导的Resveratrol对糖尿病视网膜病变“代谢记忆”的作用及其机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

TRPM7在神经细胞缺血损伤中的作用及DREAM对TRPM7调控机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

化疗诱导的细胞衰老在神经母细胞瘤复发中的作用及分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

疏肝法对情绪调节不良MCI患者工作记忆影响的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

蓖麻矮化相关RcDof基因功能分析及调控机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

PPAR β/δ基因在结直肠癌血管生成调控中的作用及分子机理

国家自然科学基金

2+阅读 · 2014年12月31日

MNDA受体NR1功能亚基对精神分裂症海马神经发生的调控及机制

国家自然科学基金

0+阅读 · 2011年12月31日

mTOR激活对吗啡耐受的调控及其分子机制

国家自然科学基金

0+阅读 · 2011年12月31日

Curcumin双向调控HO-1/HO-2协同抑制Aβeme复合物防治AD的分子机制

国家自然科学基金

0+阅读 · 2009年12月31日

表观遗传修饰Wnt/βatenin信号通路在气道重构中的作用及机制研究

国家自然科学基金

0+阅读 · 2009年12月31日

Distributed Estimation and Inference for Semi-parametric Binary Response Models

Arxiv

0+阅读 · 2022年10月15日

VHetNets for AI and AI for VHetNets: An Anomaly Detection Case Study for Ubiquitous IoT

Arxiv

0+阅读 · 2022年10月14日

An Empirical Evaluation of Multivariate Time Series Classification with Input Transformation across Different Dimensions

Arxiv

0+阅读 · 2022年10月14日

Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values

Arxiv

0+阅读 · 2022年10月14日

Collective Intelligence for Deep Learning: A Survey of Recent Developments

Arxiv

22+阅读 · 2021年12月22日

Agile, Antifragile, Artificial-Intelligence-Enabled, Command and Control

Arxiv

51+阅读 · 2021年9月14日

Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data

Arxiv

12+阅读 · 2021年2月21日

The Causal Learning of Retail Delinquency

Arxiv

14+阅读 · 2020年12月17日

A Survey on Edge Intelligence

A Survey on Edge Intelligence

Arxiv

52+阅读 · 2020年3月26日

Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI

Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI

Arxiv

77+阅读 · 2019年10月22日

VIP会员

文章信息

相关主题

贝叶斯误差

估计/估计量

相关VIP内容

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《小型无人机系统侦测追踪技术：声学、计算机视觉与深度学习融合方案》最新98页

《"牧羊人网格"拦截策略：实现无人机集群可靠拦截的新范式》

光纤无人机：反无人机系统的重大挑战

《作战建模与仿真实证研究》

相关资讯

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

中国图象图形学学会CSIG

0+阅读 · 2021年12月17日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

Distributed Estimation and Inference for Semi-parametric Binary Response Models

Arxiv

0+阅读 · 2022年10月15日

VHetNets for AI and AI for VHetNets: An Anomaly Detection Case Study for Ubiquitous IoT

Arxiv

0+阅读 · 2022年10月14日

An Empirical Evaluation of Multivariate Time Series Classification with Input Transformation across Different Dimensions

Arxiv

0+阅读 · 2022年10月14日

Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values

Arxiv

0+阅读 · 2022年10月14日

Collective Intelligence for Deep Learning: A Survey of Recent Developments

Arxiv

22+阅读 · 2021年12月22日

Agile, Antifragile, Artificial-Intelligence-Enabled, Command and Control

Arxiv

51+阅读 · 2021年9月14日

Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data

Arxiv

12+阅读 · 2021年2月21日

The Causal Learning of Retail Delinquency

Arxiv

14+阅读 · 2020年12月17日

A Survey on Edge Intelligence

A Survey on Edge Intelligence

Arxiv

52+阅读 · 2020年3月26日

Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI

Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI

Arxiv

77+阅读 · 2019年10月22日

相关基金

SIRT1介导的Resveratrol对糖尿病视网膜病变“代谢记忆”的作用及其机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

TRPM7在神经细胞缺血损伤中的作用及DREAM对TRPM7调控机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

化疗诱导的细胞衰老在神经母细胞瘤复发中的作用及分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

疏肝法对情绪调节不良MCI患者工作记忆影响的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

蓖麻矮化相关RcDof基因功能分析及调控机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

PPAR β/δ基因在结直肠癌血管生成调控中的作用及分子机理

国家自然科学基金

2+阅读 · 2014年12月31日

MNDA受体NR1功能亚基对精神分裂症海马神经发生的调控及机制

国家自然科学基金

0+阅读 · 2011年12月31日

mTOR激活对吗啡耐受的调控及其分子机制

国家自然科学基金

0+阅读 · 2011年12月31日

Curcumin双向调控HO-1/HO-2协同抑制Aβeme复合物防治AD的分子机制

国家自然科学基金

0+阅读 · 2009年12月31日

表观遗传修饰Wnt/βatenin信号通路在气道重构中的作用及机制研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员