Shapley values has established itself as one of the most appropriate and theoretically sound frameworks for explaining predictions from complex machine learning models. The popularity of Shapley values in the explanation setting is probably due to its unique theoretical properties. The main drawback with Shapley values, however, is that its computational complexity grows exponentially in the number of input features, making it unfeasible in many real world situations where there could be hundreds or thousands of features. Furthermore, with many (dependent) features, presenting/visualizing and interpreting the computed Shapley values also becomes challenging. The present paper introduces groupShapley: a conceptually simple approach for dealing with the aforementioned bottlenecks. The idea is to group the features, for example by type or dependence, and then compute and present Shapley values for these groups instead of for all individual features. Reducing hundreds or thousands of features to half a dozen or so, makes precise computations practically feasible and the presentation and knowledge extraction greatly simplified. We prove that under certain conditions, groupShapley is equivalent to summing the feature-wise Shapley values within each feature group. Moreover, we provide a simulation study exemplifying the differences when these conditions are not met. We illustrate the usability of the approach in a real world car insurance example, where groupShapley is used to provide simple and intuitive explanations.
翻译:沙普利值被确定为解释复杂机器学习模型预测的最适当和理论上最健全的框架之一。沙普利值在解释解释中最受欢迎的原因可能是其独特的理论属性。但是,沙普利值的主要缺点是,其计算复杂性在输入特性的数量上成倍增长,使得它在许多可能具有数百或数千个特性的现实世界环境中是行不通的。此外,许多(依赖的)特征,展示/视觉和解释计算出的沙普利值也变得具有挑战性。本文介绍群沙普利:处理上述瓶颈的一个概念性简单的方法。设想是将特征分组,例如按类型或依赖性分组,然后为这些组合计算和展示沙普利值,而不是所有单个特性。将数百或数千个特征降低到一半左右,使精确的计算和表述和知识提取变得非常容易。我们证明,在某些条件下,沙普利等效相当于描述每个特征组内具有特征的沙普利值。此外,我们提供一个真实的模拟方法,用以解释这些差异,我们用来解释这些差异。