Scientists and practitioners increasingly rely on machine learning to model data and draw conclusions. Compared to statistical modeling approaches, machine learning makes fewer explicit assumptions about data structures, such as linearity. However, their model parameters usually cannot be easily related to the data generating process. To learn about the modeled relationships, partial dependence (PD) plots and permutation feature importance (PFI) are often used as interpretation methods. However, PD and PFI lack a theory that relates them to the data generating process. We formalize PD and PFI as statistical estimators of ground truth estimands rooted in the data generating process. We show that PD and PFI estimates deviate from this ground truth due to statistical biases, model variance and Monte Carlo approximation errors. To account for model variance in PD and PFI estimation, we propose the learner-PD and the learner-PFI based on model refits, and propose corrected variance and confidence interval estimators.
翻译:科学家和从业者越来越多地依靠机器学习来模拟数据和得出结论。与统计模型方法相比,机器学习对数据结构(如线性)的精确假设较少。然而,其模型参数通常不易与数据生成过程相联系。要了解模型关系、部分依赖(PD)地块和变相特征的重要性(PFI)往往被用作解释方法。然而,PD和PFI缺乏与数据生成过程相关的理论。我们正式确定PD和PFI为数据生成过程中根植于地面真理估计的统计估计员。我们表明,PD和PFI估计的地面真理偏离了这一地面真理,因为统计偏差、模型差异和蒙特卡洛近似错误。为了计算PD和PFI估计中的模型差异,我们建议学习者-PD和学习者-PFI以模型为基础进行修改,并提议纠正差异和信任间隔估计员。