A model-agnostic variable importance method can be used with arbitrary prediction functions. Here we present some model-free methods that do not require access to the prediction function. This is useful when that function is proprietary and not available, or just extremely expensive. It is also useful when studying residuals from a model. The cohort Shapley (CS) method is model-free but has exponential cost in the dimension of the input space. A supervised on-manifold Shapley method from Frye et al. (2020) is also model free but requires as input a second black box model that has to be trained for the Shapley value problem. We introduce an integrated gradient (IG) version of cohort Shapley, called IGCS, with cost $\mathcal{O}(nd)$. We show that over the vast majority of the relevant unit cube that the IGCS value function is close to a multilinear function for which IGCS matches CS. Another benefit of IGCS is that is allows IG methods to be used with binary predictors. We use some area between curves (ABC) measures to quantify the performance of IGCS. On a problem from high energy physics we verify that IGCS has nearly the same ABCs as CS does. We also use it on a problem from computational chemistry in 1024 variables. We see there that IGCS attains much higher ABCs than we get from Monte Carlo sampling. The code is publicly available at https://github.com/cohortshapley/cohortintgrad
翻译:摘要:模型无关的变量重要性方法可以与任意预测函数一起使用。我们在这里介绍了一些无需访问预测函数的无模型方法。当函数是专有的且不可用或成本非常高时,这将非常有用。当从模型中研究残差时,这也非常有用。Cohort Shapley(CS)方法是无模型的,但在输入空间维数上的成本是指数级的。Frye等人(2020)提出的有监督曲面Shapley方法也是无模型的,但需要第二个黑匣子模型作为Shapley值问题的输入。我们引入了一种名为IGCS的集成梯度版本的Cohort Shapley,其成本为$\mathcal{O}(nd)$。我们证明,对于绝大多数相关单位立方体,IGCS值函数接近于多线性函数,其中IGCS与CS匹配。IGCS的另一个好处是它允许使用二进制预测器的IG方法。我们使用一些曲线之间的面积(ABC)度量来量化IGCS的性能。在来自高能物理学的问题上,我们验证了IGCS的ABC几乎与CS相同。我们还将其应用于1024个变量的计算化学问题。我们发现,与蒙特卡罗采样得到的ABC相比,IGCS达到了更高的ABC。代码可在https://github.com/cohortshapley/cohortintgrad上公开获取。