Tree-based algorithms such as random forests and gradient boosted trees continue to be among the most popular and powerful machine learning models used across multiple disciplines. The conventional wisdom of estimating the impact of a feature in tree based models is to measure the \textit{node-wise reduction of a loss function}, which (i) yields only global importance measures and (ii) is known to suffer from severe biases. Conditional feature contributions (CFCs) provide \textit{local}, case-by-case explanations of a prediction by following the decision path and attributing changes in the expected output of the model to each feature along the path. However, Lundberg et al. pointed out a potential bias of CFCs which depends on the distance from the root of a tree. The by now immensely popular alternative, SHapley Additive exPlanation (SHAP) values appear to mitigate this bias but are computationally much more expensive. Here we contribute a thorough comparison of the explanations computed by both methods on a set of 164 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. For random forests, we find extremely high similarities and correlations of both local and global SHAP values and CFC scores, leading to very similar rankings and interpretations. Analogous conclusions hold for the fidelity of using global feature importance scores as a proxy for the predictive power associated with each feature.
翻译:以树为基础的算法,如随机森林和梯度增殖树等,仍然是在多个学科中使用的最受欢迎和最强大的机器学习模型之一。估计树基模型中某个特征的影响的传统智慧是测量树基模型中某个特征的影响的常规智慧,即测量一个损失函数的距离。},它(一)只产生具有全球重要性的措施,(二)已知受到严重偏差的影响。有条件特征贡献(CFCs)提供\textit{local},逐个解释一种预测,遵循决策路径,将模型预期产出的变化归因于路径上的每个特征。然而,Lundberg 等人指出,氟氯化碳的潜在偏差取决于树根的距离。现在非常受欢迎的替代方法,Shanapley adiptive Explectation (SHAP) 值似乎减轻了这种偏差,但计算成本要高得多。我们在这里对两种方法所计算的解释进行了彻底的比较,即采用一套可公开获取的分类问题,以便向当前研究人员提供数据驱动的算法建议。对于随机森林来说,我们发现一种潜在的氟氯化碳的潜在偏差偏差,我们发现一种非常高的比和真实性的全球排序,即全球等级的比值,对SHACD的比值进行。