高维数据的模型自由显示值 (Model free Shapley values for high dimensional data)

A model-agnostic variable importance method can be used with arbitrary prediction functions. Here we present some model-free methods that do not require access to the prediction function. This is useful when that function is proprietary and not available, or just extremely expensive. It is also useful when studying residuals from a model. The cohort Shapley (CS) method is model-free but has exponential cost in the dimension of the input space. A supervised on-manifold Shapley method from Frye et al. (2020) is also model free but requires as input a second black box model that has to be trained for the Shapley value problem. We introduce an integrated gradient version of cohort Shapley, called IGCS, with cost $\mathcal{O}(nd)$. We show that over the vast majority of the relevant unit cube that the IGCS value function is close to a multilinear function for which IGCS matches CS. We use some area under the curve (AUC) measures to quantify the performance of IGCS. On a problem from high energy physics we verify that IGCS has nearly the same AUCs as CS. We also use it on a problem from computational chemistry in 1024 variables. We see there that IGCS attains much higher AUCs than we get from Monte Carlo sampling. The code is publicly available at https://github.com/cohortshapley/cohortintgrad.

翻译：任意的预测功能可以使用模型- 不可变重要度方法。我们在这里展示一些不需要获取预测功能的不使用模型的方法。当该功能是专有的, 没有可用的, 或非常昂贵时, 这非常有用。在研究模型的剩余部分时, 组群 Shapley (CS) 方法没有模型, 但在输入空间的维度方面成本指数值。 Frye 等人 (202020年) 的监管的在皮层上显示的Shapley 方法也是免费的, 但作为输入需要第二个黑盒模型。我们引入了一个不需要为腐蚀值问题培训的黑盒模型。我们引入了一个组合 Shapley, 叫做 IGCS, 叫做 IGCS, 成本$\ mathcal{O} (nd) 。我们显示, 在相关单位的绝大多数的立体体中, IGCS 值功能接近一个多线性函数, IGCS 。我们使用曲线下的某个区域来量化 IGCS 的性能。在高能物理学上, 我们核查IGCS 有近于 CL 10 的变数, 我们从 CLS 的CUCS 也看到它从 CLO 。