数据-OOB: 外袋估计作为简单高效的数据价值 (Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value) - 专知论文

会员服务 ·

0

数据价值 · 数据点 · 数据集 · 弱学习器 · 影响函数 ·

2023 年 4 月 16 日

Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value

翻译：数据-OOB: 外袋估计作为简单高效的数据价值

Yongchan Kwon,James Zou

Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Specifically, Data-OOB takes less than 2.25 hours on a single CPU processor when there are $10^6$ samples to evaluate and the input dimension is 100. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same important data point as the infinitesimal jackknife influence function when two different points are compared. We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.

翻译：数据估值是一种强大的框架，可为模型训练中有益或有害的数据提供统计洞察。许多基于 Shapley 的数据估值方法在各种下游任务中显示出了很有前途的结果，然而，由于需要训练大量模型，它们被广泛认为是计算上具有挑战性。因此，将其应用于大型数据集被认为是不可行的。针对这个问题，我们提出了 Data-OOB，这是一种针对 Bagging 模型的新数据估值方法，它利用了外袋估计。所提出的方法计算效率高，并且可以通过重复使用已训练的弱学习器来扩展到数百万的数据。具体而言，当有 $10^6$ 个样本需要评估且输入维度为 100 时，Data-OOB 仅需花费不到 2.25 小时的单个 CPU 处理器时间。此外，在理论上，Data-OOB 有可靠的解释，当比较两个不同点时，它识别了与无限小杰克内夫影响函数相同的重要数据点。我们使用 12 个分类数据集进行全面实验，每个数据集都有数千个样本大小。我们证明所提出的方法在识别错误标记的数据和找到一组有用(或有害)数据点方面明显优于现有的最先进数据估值方法，突显了在实际应用中应用数据价值的潜力。

0

相关内容

数据价值

对比学习需要哪样的数据？UCLA最新ICML2023论文《数据高效对比学习：简单样本贡献最大》，探究量化样本对SSL的贡献度

对比学习需要哪样的数据？UCLA最新ICML2023论文《数据高效对比学习：简单样本贡献最大》，探究量化样本对SSL的贡献度

专知会员服务

36+阅读 · 2023年5月14日

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

【Google】高效Transformer综述，Efficient Transformers: A Survey

【Google】高效Transformer综述，Efficient Transformers: A Survey

专知会员服务

66+阅读 · 2022年3月17日

【2021新书】并行高性能计算，705页pdf，Parallel and High Performance Computing

【2021新书】并行高性能计算，705页pdf，Parallel and High Performance Computing

专知会员服务

105+阅读 · 2021年10月30日

【Google】最新《高效Transformers》综述大全，Efficient Transformers: A Survey

【Google】最新《高效Transformers》综述大全，Efficient Transformers: A Survey

专知会员服务

113+阅读 · 2020年9月17日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【Google】具有秩-1因子的高效可扩展贝叶斯神经网络，Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors

【Google】具有秩-1因子的高效可扩展贝叶斯神经网络，Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors

专知会员服务

14+阅读 · 2020年5月19日

【自监督学习新成果】基于对比预测编码的数据高效图像识别（Data-Efficient Image Recognition with Contrastive Predictive Coding）

【自监督学习新成果】基于对比预测编码的数据高效图像识别（Data-Efficient Image Recognition with Contrastive Predictive Coding）

专知会员服务

16+阅读 · 2019年12月10日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

计算机 | EMNLP 2019等国际会议信息6条

计算机 | EMNLP 2019等国际会议信息6条

Call4Papers

18+阅读 · 2019年4月26日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

用一行tf.data实现数据Shuffle、Batch划分、异步预加载等

用一行tf.data实现数据Shuffle、Batch划分、异步预加载等

专知

21+阅读 · 2019年3月26日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

利用动态深度学习预测金融时间序列基于Python

利用动态深度学习预测金融时间序列基于Python

量化投资与机器学习

18+阅读 · 2018年10月30日

【泡泡一分钟】SSD6D：基于RGB的三维检测和6自由度位姿估计(ICCV2017-159)

【泡泡一分钟】SSD6D：基于RGB的三维检测和6自由度位姿估计(ICCV2017-159)

泡泡机器人SLAM

17+阅读 · 2018年10月12日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【推荐】深度学习目标检测概览

【推荐】深度学习目标检测概览

机器学习研究会

10+阅读 · 2017年9月1日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

排序集抽样下随机删失数据的非参数估计

国家自然科学基金

1+阅读 · 2014年12月31日

基于似然函数的统计推断

国家自然科学基金

5+阅读 · 2014年12月31日

基于负相协随机数据的带加法噪声密度函数的小波估计

国家自然科学基金

0+阅读 · 2013年12月31日

东中国海气溶胶光学性质及气溶胶模型研究

国家自然科学基金

0+阅读 · 2013年12月31日

带有随机效应的广义空间自回归模型的统计推断

国家自然科学基金

0+阅读 · 2013年12月31日

高维数据的非参数经验贝叶斯方法

国家自然科学基金

1+阅读 · 2012年12月31日

复杂色谱-质谱联用数据准确快速定性分析新方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

截面相依数据的建模、理论及应用

国家自然科学基金

1+阅读 · 2012年12月31日

CPU/GPU协同并行计算在第一性原理电子输运模拟中的应用

国家自然科学基金

0+阅读 · 2011年12月31日

纳米管阵列修饰表面的沸腾传热机理

国家自然科学基金

0+阅读 · 2009年12月31日

A General Framework for Regression with Mismatched Data Based on Mixture Modeling

Arxiv

0+阅读 · 2023年6月1日

An End-to-End Time Series Model for Simultaneous Imputation and Forecast

Arxiv

1+阅读 · 2023年6月1日

Enhancing the Unified Streaming and Non-streaming Model with Contrastive Learning

Arxiv

0+阅读 · 2023年6月1日

Bag of Tricks for Training Data Extraction from Language Models

Arxiv

0+阅读 · 2023年6月1日

Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value

Arxiv

0+阅读 · 2023年6月1日

Estimation of Multivariate Discrete Hawkes Processes: An Application to Incident Monitoring

Estimation of Multivariate Discrete Hawkes Processes: An Application to Incident Monitoring

Arxiv

0+阅读 · 2023年5月31日

Efficient Online Reinforcement Learning with Offline Data

Arxiv

0+阅读 · 2023年5月31日

PATHFINDER: Designing Stimuli for Neuromodulation through data-driven inverse estimation of non-linear functions

Arxiv

0+阅读 · 2023年5月30日

Transfer Learning With Efficient Estimators to Optimally Leverage Historical Data in Analysis of Randomized Trials

Arxiv

0+阅读 · 2023年5月30日

Adaptive Selection of the Optimal Strategy to Improve Precision and Power in Randomized Trials

Arxiv

0+阅读 · 2023年5月30日

VIP会员

文章信息

相关主题

相关VIP内容

对比学习需要哪样的数据？UCLA最新ICML2023论文《数据高效对比学习：简单样本贡献最大》，探究量化样本对SSL的贡献度

对比学习需要哪样的数据？UCLA最新ICML2023论文《数据高效对比学习：简单样本贡献最大》，探究量化样本对SSL的贡献度

专知会员服务

36+阅读 · 2023年5月14日

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

【Google】高效Transformer综述，Efficient Transformers: A Survey

【Google】高效Transformer综述，Efficient Transformers: A Survey

专知会员服务

66+阅读 · 2022年3月17日

【2021新书】并行高性能计算，705页pdf，Parallel and High Performance Computing

【2021新书】并行高性能计算，705页pdf，Parallel and High Performance Computing

专知会员服务

105+阅读 · 2021年10月30日

【Google】最新《高效Transformers》综述大全，Efficient Transformers: A Survey

【Google】最新《高效Transformers》综述大全，Efficient Transformers: A Survey

专知会员服务

113+阅读 · 2020年9月17日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【Google】具有秩-1因子的高效可扩展贝叶斯神经网络，Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors

【Google】具有秩-1因子的高效可扩展贝叶斯神经网络，Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors

专知会员服务

14+阅读 · 2020年5月19日

【自监督学习新成果】基于对比预测编码的数据高效图像识别（Data-Efficient Image Recognition with Contrastive Predictive Coding）

【自监督学习新成果】基于对比预测编码的数据高效图像识别（Data-Efficient Image Recognition with Contrastive Predictive Coding）

专知会员服务

16+阅读 · 2019年12月10日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

操作系统智能体：基于多模态大模型（MLLM）的通用计算设备智能体综述

《美国太空军系统全生命周期建模、仿真与分析效能提升方案》最新84页报告

【博士论文】推进数据高效的深度学习：非参数 Transformer、主动测试与上下文学习

自主人工智能：未来战争是否将是自主化的？

相关资讯

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

计算机 | EMNLP 2019等国际会议信息6条

计算机 | EMNLP 2019等国际会议信息6条

Call4Papers

18+阅读 · 2019年4月26日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

用一行tf.data实现数据Shuffle、Batch划分、异步预加载等

用一行tf.data实现数据Shuffle、Batch划分、异步预加载等

专知

21+阅读 · 2019年3月26日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

利用动态深度学习预测金融时间序列基于Python

利用动态深度学习预测金融时间序列基于Python

量化投资与机器学习

18+阅读 · 2018年10月30日

【泡泡一分钟】SSD6D：基于RGB的三维检测和6自由度位姿估计(ICCV2017-159)

【泡泡一分钟】SSD6D：基于RGB的三维检测和6自由度位姿估计(ICCV2017-159)

泡泡机器人SLAM

17+阅读 · 2018年10月12日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【推荐】深度学习目标检测概览

【推荐】深度学习目标检测概览

机器学习研究会

10+阅读 · 2017年9月1日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

相关论文

A General Framework for Regression with Mismatched Data Based on Mixture Modeling

Arxiv

0+阅读 · 2023年6月1日

An End-to-End Time Series Model for Simultaneous Imputation and Forecast

Arxiv

1+阅读 · 2023年6月1日

Enhancing the Unified Streaming and Non-streaming Model with Contrastive Learning

Arxiv

0+阅读 · 2023年6月1日

Bag of Tricks for Training Data Extraction from Language Models

Arxiv

0+阅读 · 2023年6月1日

Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value

Arxiv

0+阅读 · 2023年6月1日

Estimation of Multivariate Discrete Hawkes Processes: An Application to Incident Monitoring

Estimation of Multivariate Discrete Hawkes Processes: An Application to Incident Monitoring

Arxiv

0+阅读 · 2023年5月31日

Efficient Online Reinforcement Learning with Offline Data

Arxiv

0+阅读 · 2023年5月31日

PATHFINDER: Designing Stimuli for Neuromodulation through data-driven inverse estimation of non-linear functions

Arxiv

0+阅读 · 2023年5月30日

Transfer Learning With Efficient Estimators to Optimally Leverage Historical Data in Analysis of Randomized Trials

Arxiv

0+阅读 · 2023年5月30日

Adaptive Selection of the Optimal Strategy to Improve Precision and Power in Randomized Trials

Arxiv

0+阅读 · 2023年5月30日

相关基金

排序集抽样下随机删失数据的非参数估计

国家自然科学基金

1+阅读 · 2014年12月31日

基于似然函数的统计推断

国家自然科学基金

5+阅读 · 2014年12月31日

基于负相协随机数据的带加法噪声密度函数的小波估计

国家自然科学基金

0+阅读 · 2013年12月31日

东中国海气溶胶光学性质及气溶胶模型研究

国家自然科学基金

0+阅读 · 2013年12月31日

带有随机效应的广义空间自回归模型的统计推断

国家自然科学基金

0+阅读 · 2013年12月31日

高维数据的非参数经验贝叶斯方法

国家自然科学基金

1+阅读 · 2012年12月31日

复杂色谱-质谱联用数据准确快速定性分析新方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

截面相依数据的建模、理论及应用

国家自然科学基金

1+阅读 · 2012年12月31日

CPU/GPU协同并行计算在第一性原理电子输运模拟中的应用

国家自然科学基金

0+阅读 · 2011年12月31日

纳米管阵列修饰表面的沸腾传热机理

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员