We consider finding a counterfactual explanation for a classification or regression forest, such as a random forest. This requires solving an optimization problem to find the closest input instance to a given instance for which the forest outputs a desired value. Finding an exact solution has a cost that is exponential on the number of leaves in the forest. We propose a simple but very effective approach: we constrain the optimization to only those input space regions defined by the forest that are populated by actual data points. The problem reduces to a form of nearest-neighbor search using a certain distance on a certain dataset. This has two advantages: first, the solution can be found very quickly, scaling to large forests and high-dimensional data, and enabling interactive use. Second, the solution found is more likely to be realistic in that it is guided towards high-density areas of input space.
翻译:我们考虑为分类或回归森林(如随机森林)找到一个反事实的解释。这要求解决一个优化问题,找到森林产出理想值最接近的输入实例。找到精确的解决方案的成本在森林叶数上是指数性的。我们提出了一个简单但非常有效的方法:我们将优化限制在由森林界定的、有实际数据点的输入空间区域;问题减少为一种使用某数据集某种距离进行近邻搜索的形式。这有两个好处:首先,可以非常迅速地找到解决方案,向大森林和高维数据过渡,并能够进行互动使用。第二,发现的解决办法更有可能现实,因为它是向高密度输入空间方向的。</s>