Classifier specific (CS) and classifier agnostic (CA) feature importance methods are widely used (often interchangeably) by prior studies to derive feature importance ranks from a defect classifier. However, different feature importance methods are likely to compute different feature importance ranks even for the same dataset and classifier. Hence such interchangeable use of feature importance methods can lead to conclusion instabilities unless there is a strong agreement among different methods. Therefore, in this paper, we evaluate the agreement between the feature importance ranks associated with the studied classifiers through a case study of 18 software projects and six commonly used classifiers. We find that: 1) The computed feature importance ranks by CA and CS methods do not always strongly agree with each other. 2) The computed feature importance ranks by the studied CA methods exhibit a strong agreement including the features reported at top-1 and top-3 ranks for a given dataset and classifier, while even the commonly used CS methods yield vastly different feature importance ranks. Such findings raise concerns about the stability of conclusions across replicated studies. We further observe that the commonly used defect datasets are rife with feature interactions and these feature interactions impact the computed feature importance ranks of the CS methods (not the CA methods). We demonstrate that removing these feature interactions, even with simple methods like CFS improves agreement between the computed feature importance ranks of CA and CS methods. In light of our findings, we provide guidelines for stakeholders and practitioners when performing model interpretation and directions for future research, e.g., future research is needed to investigate the impact of advanced feature interaction removal methods on computed feature importance ranks of different CS methods.
翻译:(CS) 和 分类(CA) 特征特征重要方法被先前的研究广泛使用(经常互换),以便从一个缺陷分类者中得出特征重要等级。但是,不同的特征重要方法可能计算出不同的特征重要等级,即使同一数据集和分类者也是如此。因此,这种可互换使用特征重要方法可能导致不稳定性,除非不同方法之间达成强烈一致。因此,在本文件中,我们通过对18个软件项目和6个常用分类者的案例研究,评估与所研究分类者相关的特征重要等级之间的协议。我们发现:1) CA和 CS 方法的计算特征重要程度并不总是相互一致。 2 所研究的CA方法的计算重要程度,包括上一级和上一级报告的特征重要等级,对于给定数据集和分类者来说,即使常用CS 方法具有很大的特征重要性。 这些发现,通常使用的缺陷数据集与特征相互作用,甚至这些特征相互作用影响CS 的计算方法在CAA 的精确性分析方法中,我们为 CA 的深度分析方法提供了我们 的计算方法的深度分析质量 。