The interpretability of machine learning models has been an essential area of research for the safe deployment of machine learning systems. One particular approach is to attribute model decisions to high-level concepts that humans can understand. However, such concept-based explainability for Deep Neural Networks (DNNs) has been studied mostly on image domain. In this paper, we extend TCAV, the concept attribution approach, to tabular learning, by providing an idea on how to define concepts over tabular data. On a synthetic dataset with ground-truth concept explanations and a real-world dataset, we show the validity of our method in generating interpretability results that match the human-level intuitions. On top of this, we propose a notion of fairness based on TCAV that quantifies what layer of DNN has learned representations that lead to biased predictions of the model. Also, we empirically demonstrate the relation of TCAV-based fairness to a group fairness notion, Demographic Parity.
翻译:机器学习模型的可解释性一直是安全部署机器学习系统的一个重要研究领域。一种特定的方法是将示范决定归因于人类能够理解的高级概念。然而,这种基于概念的深神经网络(深神经网络)解释性(DNNs)主要在图像领域进行了研究。在本文中,我们通过提供如何定义概念而不是表格数据的概念的构想,将TCAV、概念归属方法扩大到表格学习。关于带有地面真实概念解释和真实世界数据集的合成数据集,我们展示了我们产生与人类水平直觉相匹配的解释性结果的方法的有效性。此外,我们提出了基于TCAV的公平性概念,该概念量化了DNNM学会的哪些层面导致对模型有偏见的预测。此外,我们从经验上展示了以TCA为基础的公平与群体公平概念(人口均等)之间的关系。