随机森林的MDA:不一致,通过Sobol-MDA的切实解决办法 (MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA) - 专知论文

会员服务 ·

0

随机森林 · 模型评估 · TOOLS · 统计量 · 黑盒 ·

2021 年 11 月 17 日

MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA

翻译：随机森林的MDA:不一致,通过Sobol-MDA的切实解决办法

Clément Bénard,Sébastien da Veiga,Erwan Scornet

Variable importance measures are the main tools to analyze the black-box mechanisms of random forests. Although the mean decrease accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. In fact, the exact MDA definition varies across the main random forest software. In this article, our objective is to rigorously analyze the behavior of the main MDA implementations. Consequently, we mathematically formalize the various implemented MDA algorithms, and then establish their limits when the sample size increases. In particular, we break down these limits in three components: the first one is related to Sobol indices, which are well-defined measures of a covariate contribution to the response variance, widely used in the sensitivity analysis field, as opposed to thethird term, whose value increases with dependence within covariates. Thus, we theoretically demonstrate that the MDA does not target the right quantity when covariates are dependent, a fact that has already been noticed experimentally. To address this issue, we define a new importance measure for random forests, the Sobol-MDA, which fixes the flaws of the original MDA. We prove the consistency of the Sobol-MDA and show thatthe Sobol-MDA empirically outperforms its competitors on both simulated and real data. An open source implementation in R and C++ is available online.

翻译：变量重要性措施是分析随机森林黑盒机制的主要工具。虽然平均降低精确度(MDA)被广泛接受为随机森林最有效的变量重要度量, 但其统计属性却鲜为人知。事实上, 精确的MDA定义在主要的随机森林软件中各不相同。在本篇文章中, 我们的目标是严格分析主要的 MDA 执行过程的行为。因此, 我们从数学上将各种已执行的MDA 算法正式化, 然后在抽样规模增加时确定其限制。特别是, 我们分解了三个组成部分中的这些限制: 第一个部分与Sobol指数有关, 前者与Sobol指数有关, 后者是用于敏感度分析领域对响应差异作出共变相贡献的精确度度度, 而后者则与第三个术语不同, 后者的价值随共变数的依赖性而增加。因此, 我们理论上证明, MDA没有在共变量依赖的情况下瞄准正确的数量, 这一事实已经被实验性地注意到了。为了解决这个问题, 我们定义了随机森林的新的重要度度尺度, Sobol-MDA, 和SOMA 的在线数据源。我们证明它的真实性地展示了SOMA 和BRA 的试。

0

相关内容

随机森林

随机森林指的是利用多棵树对样本进行训练并预测的一种分类器。

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等

【经典书】图理论与应用，270页pdf

专知会员服务

86+阅读 · 2020年12月5日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【CVPR2020】视觉跟踪的概率回归，Probabilistic Regression for Visual Tracking

【CVPR2020】视觉跟踪的概率回归，Probabilistic Regression for Visual Tracking

专知会员服务

37+阅读 · 2020年3月27日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【新书】Python编程基础，669页pdf

【新书】Python编程基础，669页pdf

专知会员服务

197+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

哈佛大学Miguel Hernan科学家最新2019年《因果推断:概念与方法》书稿终版，280页讲解因果效应（附下载）

哈佛大学Miguel Hernan科学家最新2019年《因果推断:概念与方法》书稿终版，280页讲解因果效应（附下载）

专知

77+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

lightgbm algorithm case of kaggle（上）

lightgbm algorithm case of kaggle（上）

R语言中文社区

8+阅读 · 2018年3月20日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

【数据集】新的YELP数据集官方下载

【数据集】新的YELP数据集官方下载

机器学习研究会

16+阅读 · 2017年8月31日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

Adaptive Data Analysis with Correlated Observations

Arxiv

0+阅读 · 2022年1月21日

Random Noise vs State-of-the-Art Probabilistic Forecasting Methods : A Case Study on CRPS-Sum Discrimination Ability

Arxiv

0+阅读 · 2022年1月21日

Spatial Matrix Completion for Spatially-Misaligned and High-Dimensional Air Pollution Data

Arxiv

0+阅读 · 2022年1月21日

Learning with latent group sparsity via heat flow dynamics on networks

Learning with latent group sparsity via heat flow dynamics on networks

Arxiv

0+阅读 · 2022年1月20日

Inference in High-dimensional Multivariate Response Regression with Hidden Variables

Arxiv

0+阅读 · 2022年1月20日

Joint Placement and Allocation of VNF Nodes with Budget and Capacity Constraints

Arxiv

0+阅读 · 2022年1月20日

Learning Inconsistent Preferences with Gaussian Processes

Learning Inconsistent Preferences with Gaussian Processes

Arxiv

0+阅读 · 2022年1月19日

Adaptive inference for small diffusion processes based on sampled data

Arxiv

0+阅读 · 2022年1月19日

Error analysis for a statistical finite element method

Arxiv

0+阅读 · 2022年1月19日

Statistical Inference on Explained Variation in High-dimensional Linear Model with Dense Effects

Arxiv

0+阅读 · 2022年1月18日

VIP会员

文章信息

相关主题

相关VIP内容

【经典书】图理论与应用，270页pdf

专知会员服务

86+阅读 · 2020年12月5日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【CVPR2020】视觉跟踪的概率回归，Probabilistic Regression for Visual Tracking

【CVPR2020】视觉跟踪的概率回归，Probabilistic Regression for Visual Tracking

专知会员服务

37+阅读 · 2020年3月27日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【新书】Python编程基础，669页pdf

【新书】Python编程基础，669页pdf

专知会员服务

197+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】基础模型训练中网络规模数据的负责任与高效使用

《俄乌战争背景下俄罗斯的战略性海军分析（2022-2025年）》最新100页报告

人工智能时代背景下的未来海战

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

哈佛大学Miguel Hernan科学家最新2019年《因果推断:概念与方法》书稿终版，280页讲解因果效应（附下载）

哈佛大学Miguel Hernan科学家最新2019年《因果推断:概念与方法》书稿终版，280页讲解因果效应（附下载）

专知

77+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

lightgbm algorithm case of kaggle（上）

lightgbm algorithm case of kaggle（上）

R语言中文社区

8+阅读 · 2018年3月20日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

【数据集】新的YELP数据集官方下载

【数据集】新的YELP数据集官方下载

机器学习研究会

16+阅读 · 2017年8月31日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

相关论文

Adaptive Data Analysis with Correlated Observations

Arxiv

0+阅读 · 2022年1月21日

Random Noise vs State-of-the-Art Probabilistic Forecasting Methods : A Case Study on CRPS-Sum Discrimination Ability

Arxiv

0+阅读 · 2022年1月21日

Spatial Matrix Completion for Spatially-Misaligned and High-Dimensional Air Pollution Data

Arxiv

0+阅读 · 2022年1月21日

Learning with latent group sparsity via heat flow dynamics on networks

Learning with latent group sparsity via heat flow dynamics on networks

Arxiv

0+阅读 · 2022年1月20日

Inference in High-dimensional Multivariate Response Regression with Hidden Variables

Arxiv

0+阅读 · 2022年1月20日

Joint Placement and Allocation of VNF Nodes with Budget and Capacity Constraints

Arxiv

0+阅读 · 2022年1月20日

Learning Inconsistent Preferences with Gaussian Processes

Learning Inconsistent Preferences with Gaussian Processes

Arxiv

0+阅读 · 2022年1月19日

Adaptive inference for small diffusion processes based on sampled data

Arxiv

0+阅读 · 2022年1月19日

Error analysis for a statistical finite element method

Arxiv

0+阅读 · 2022年1月19日

Statistical Inference on Explained Variation in High-dimensional Linear Model with Dense Effects

Arxiv

0+阅读 · 2022年1月18日

微信扫码咨询专知VIP会员