Objective: Shapley additive explanations (SHAP) is a popular post-hoc technique for explaining black box models. While the impact of data imbalance on predictive models has been extensively studied, it remains largely unknown with respect to SHAP-based model explanations. This study sought to investigate the effects of data imbalance on SHAP explanations for deep learning models, and to propose a strategy to mitigate these effects. Materials and Methods: We propose to adjust class distributions in the background and explanation data in SHAP when explaining black box models. Our data balancing strategy is to compose background data and explanation data with an equal distribution of classes. To evaluate the effects of data adjustment on model explanation, we propose to use the beeswarm plot as a qualitative tool to identify "abnormal" explanation artifacts, and quantitatively test the consistency between variable importance and prediction power. We demonstrated our proposed approach in an empirical study that predicted inpatient mortality using the Medical Information Mart for Intensive Care (MIMIC-III) data and a multilayer perceptron. Results: Using the data balancing strategy would allow us to reduce the number of the artifacts in the beeswarm plot, thus mitigating the negative effects of data imbalance. Additionally, with the balancing strategy, the top-ranked variables from the corresponding importance ranking demonstrated improved discrimination power. Discussion and Conclusion: Our findings suggest that balanced background and explanation data could help reduce the noise in explanation results induced by skewed data distribution and improve the reliability of variable importance ranking. Furthermore, these balancing procedures improve the potential of SHAP in identifying patients with abnormal characteristics in clinical applications.
翻译:目标: 变相添加解释( SHAP) 是用来解释黑盒模型的流行后热后技术。 虽然数据不平衡对预测模型的影响已经进行了广泛研究, 但对于基于 SHAP 的模型解释, 数据不平衡对预测模型的影响仍然鲜为人知。 这项研究旨在调查数据不平衡对SHAP解释对于深层学习模型的影响, 并提出减轻这些影响的战略。 材料和方法: 我们提议在解释黑盒模型时, 调整SHAP背景中的分类分布和解释数据。 我们的数据平衡战略是将背景数据和解释数据与类别分布的平衡化应用结合起来。 为了评估数据调整对模型解释的影响, 我们提议使用蜂温图作为定性工具, 确定“ 异常” 解释工艺, 从数量上测试可变重要性与预测力之间的一致性。 我们在一项实验研究中展示了我们提出的方法, 使用医疗信息用于密集护理(MIMI-III) 的数据和多层透视点数据。 结果: 利用数据平衡战略将让我们减少模型中精细的精选数据数量, 确定模型的重要性, 从而显示分析分析分析分析分析结果的变序分析结果的偏差分析结果。