Tree ensemble methods such as random forests [Breiman, 2001] are very popular to handle high-dimensional tabular data sets, notably because of their good predictive accuracy. However, when machine learning is used for decision-making problems, settling for the best predictive procedures may not be reasonable since enlightened decisions require an in-depth comprehension of the algorithm prediction process. Unfortunately, random forests are not intrinsically interpretable since their prediction results from averaging several hundreds of decision trees. A classic approach to gain knowledge on this so-called black-box algorithm is to compute variable importances, that are employed to assess the predictive impact of each input variable. Variable importances are then used to rank or select variables and thus play a great role in data analysis. Nevertheless, there is no justification to use random forest variable importances in such way: we do not even know what these quantities estimate. In this paper, we analyze one of the two well-known random forest variable importances, the Mean Decrease Impurity (MDI). We prove that if input variables are independent and in absence of interactions, MDI provides a variance decomposition of the output, where the contribution of each variable is clearly identified. We also study models exhibiting dependence between input variables or interaction, for which the variable importance is intrinsically ill-defined. Our analysis shows that there may exist some benefits to use a forest compared to a single tree.
翻译:随机森林[Breiman, 2001] 等树木组合方法非常受欢迎,可以处理高维的表格数据集,这主要是因为其预测准确性很高。然而,当机器学习用于决策问题时,确定最佳预测程序可能不合理,因为开明的决定要求深入理解算法预测过程。不幸的是,随机森林并非内在解释,因为它们的预测结果来自平均数以百计的决策树。一种获得关于这种所谓的黑盒算法的知识的经典方法,是计算变量重要性,用来评估每个输入变量的预测影响。变量重要性随后用于排序或选择变量,从而在数据分析中发挥重要作用。然而,没有理由使用随机森林变量的重要性如此:我们甚至不知道这些数量估计是什么。在本文件中,我们分析了两个广为人知的随机森林变量的重要性之一,即 " 低度 " (MDI) 。我们证明,如果输入变量是独立的,并且在没有相互作用的情况下,MDI提供了产出的变量变异位置,从而在数据中起到很大的作用。 然而,在每种变量的变量中,我们可以清楚地展示一种变量对各种变数的分析。