To promote new scientific discoveries from complex data sets, feature importance inference has been a long-standing statistical problem. Instead of testing for parameters that are only interpretable for specific models, there has been increasing interest in model-agnostic methods, often in the form of feature occlusion or leave-one-covariate-out (LOCO) inference. Existing approaches often make distributional assumptions, which can be difficult to verify in practice, or require model refitting and data splitting, which are computationally intensive and lead to losses in power. In this work, we develop a novel, mostly model-agnostic and distribution-free inference framework for feature importance that is computationally efficient and statistically powerful. Our approach is fast as we avoid model refitting by leveraging a form of random observation and feature subsampling called minipatch ensembles; this approach also improves statistical power by avoiding data splitting. Our framework can be applied on tabular data and with any machine learning algorithm, together with minipatch ensembles, for regression and classification tasks. Despite the dependencies induced by using minipatch ensembles, we show that our approach provides asymptotic coverage for the feature importance score of any model under mild assumptions. Finally, our same procedure can also be leveraged to provide valid confidence intervals for predictions, hence providing fast, simultaneous quantification of the uncertainty of both predictions and feature importance. We validate our intervals on a series of synthetic and real data examples, including non-linear settings, showing that our approach detects the correct important features and exhibits many computational and statistical advantages over existing methods.
翻译:为促进从复杂的数据集中获取新的科学发现,具有显著重要性的推论是一个长期存在的统计问题。我们不是要测试仅对具体模型进行解释的参数,而是要测试一个新颖的、主要是模型的和无分配的推论,而是要对模型不可知性方法越来越感兴趣,这些方法往往以特征封闭或放出一分之差(LOCO)推论的形式出现。现有方法往往会作出分配性假设,而这种假设在实践中很难核实,或需要模型的调整和数据分割,而这种假设是计算密集的,并导致权力的丧失。在这项工作中,我们为特性的重要性开发了一个新颖的、主要是模型的、模型的和无分配的推论框架,这种框架具有计算效率和统计上的强势。我们的方法是快速的,因为我们通过利用随机的观察形式和缩放动的子标本来进行重新校准,这种方法也通过数据分割来提高统计力量。 我们的框架可以应用于表格数据以及任何机器的校正算算算法,同时进行小型的测算和分类工作。尽管我们使用小型的模型的模型在精确的精确的精确的精确的精确的推算方法可以提供我们现有的精确的推算方法。