Estimating feature importance is a significant aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley value based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published feature importance measure of SAGE (Shapley additive global importance) and introduce sub-SAGE which can be estimated without resampling for tree-based models. We argue that the uncertainties can be estimated from bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as high-dimensional genomics data.
翻译:估计地物的重要性是解释以数据为基础的模型的一个重要方面。除了解释模型本身之外,一个同样相关的问题是哪些特征在基本数据生成过程中很重要。我们提出了一个基于光滑价值的框架,用以推断个别特征的重要性,包括估算器中的不确定性。我们以最近公布的SAGE(Shapley添加剂的全球重要性)的地物重要性衡量尺度为基础,并采用可不重新标注以树为基础的模型而加以估计的子SAGE。我们争辩说,不确定性可以从靴子中估算,并展示树木合用方法的方法。该框架以合成数据和高位基因组数据为示例。