比较可解释性和特征选择的解释性 (Comparing interpretability and explainability for feature selection)

A common approach for feature selection is to examine the variable importance scores for a machine learning model, as a way to understand which features are the most relevant for making predictions. Given the significance of feature selection, it is crucial for the calculated importance scores to reflect reality. Falsely overestimating the importance of irrelevant features can lead to false discoveries, while underestimating importance of relevant features may lead us to discard important features, resulting in poor model performance. Additionally, black-box models like XGBoost provide state-of-the art predictive performance, but cannot be easily understood by humans, and thus we rely on variable importance scores or methods for explainability like SHAP to offer insight into their behavior. In this paper, we investigate the performance of variable importance as a feature selection method across various black-box and interpretable machine learning methods. We compare the ability of CART, Optimal Trees, XGBoost and SHAP to correctly identify the relevant subset of variables across a number of experiments. The results show that regardless of whether we use the native variable importance method or SHAP, XGBoost fails to clearly distinguish between relevant and irrelevant features. On the other hand, the interpretable methods are able to correctly and efficiently identify irrelevant features, and thus offer significantly better performance for feature selection.

翻译：特征选择的常见方法是检查机器学习模型的变量重要性分数,以了解哪些特征与预测最为相关。鉴于特征选择的意义,对于计算出的重要性分数至关重要,以反映现实。误高估无关特征的重要性可能导致错误发现,而低估相关特征的重要性可能导致我们丢弃重要特征,导致模型性能差。此外,XGBoost等黑盒模型提供了最先进的预测性能,但人类无法轻易理解,因此我们依赖可变的重要性分数或方法来解释这些特征,例如SHAP,以洞察其行为。在本文中,我们调查不同黑盒和可解释的机器学习方法中作为特征选择方法的不同重要性表现。我们比较了CART、优化树、XGBoost和SHAP的能力,以正确识别一系列实验中的相关变量的子集。结果显示,无论我们使用本地变式重要性方法还是SHAP, XGBoost都无法清楚地区分相关和不相干的特点。因此,我们比较了其他功能选择方法的准确性和不相干。

相关内容

特征选择

关注 5931

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

可解释高效异构图卷积网络，Interpretable and Efficient Heterogeneous Graph Convolutional Network

专知会员服务

63+阅读 · 2020年7月12日

《可解释的机器学习-interpretable-ml》238页pdf

专知会员服务

208+阅读 · 2020年2月24日

【开放书】预测模型:探索、解释和调试，以人为本的可解释机器学习，Predictive Models: Explore, Explain, and Debug，Human-Centered Interpretable Machine Learning

专知会员服务

37+阅读 · 2019年12月26日

【斯坦福大学】面向可解释人工智能:神经网络的显著性检验（Towards Explainable AI: Significance Tests for Neural Networks），26页pdf

专知会员服务

27+阅读 · 2019年12月19日