Variable importance measures are the main tools to analyze the black-box mechanisms of random forests. Although the mean decrease accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. In fact, the exact MDA definition varies across the main random forest software. In this article, our objective is to rigorously analyze the behavior of the main MDA implementations. Consequently, we mathematically formalize the various implemented MDA algorithms, and then establish their limits when the sample size increases. In particular, we break down these limits in three components: the first one is related to Sobol indices, which are well-defined measures of a covariate contribution to the response variance, widely used in the sensitivity analysis field, as opposed to thethird term, whose value increases with dependence within covariates. Thus, we theoretically demonstrate that the MDA does not target the right quantity when covariates are dependent, a fact that has already been noticed experimentally. To address this issue, we define a new importance measure for random forests, the Sobol-MDA, which fixes the flaws of the original MDA. We prove the consistency of the Sobol-MDA and show thatthe Sobol-MDA empirically outperforms its competitors on both simulated and real data. An open source implementation in R and C++ is available online.
翻译:变量重要性措施是分析随机森林黑盒机制的主要工具。 虽然平均降低精确度(MDA)被广泛接受为随机森林最有效的变量重要度量, 但其统计属性却鲜为人知。 事实上, 精确的MDA定义在主要的随机森林软件中各不相同。 在本篇文章中, 我们的目标是严格分析主要的 MDA 执行过程的行为。 因此, 我们从数学上将各种已执行的MDA 算法正式化, 然后在抽样规模增加时确定其限制。 特别是, 我们分解了三个组成部分中的这些限制: 第一个部分与Sobol指数有关, 前者与Sobol指数有关, 后者是用于敏感度分析领域对响应差异作出共变相贡献的精确度度度, 而后者则与第三个术语不同, 后者的价值随共变数的依赖性而增加。 因此, 我们理论上证明, MDA没有在共变量依赖的情况下瞄准正确的数量, 这一事实已经被实验性地注意到了。 为了解决这个问题, 我们定义了随机森林的新的重要度度尺度, Sobol-MDA, 和SOMA 的在线数据源。 我们证明它的真实性地展示了SOMA 和BRA 的试 。