To tackle interpretability in deep learning, we present a novel framework to jointly learn a predictive model and its associated interpretation model. The interpreter provides both local and global interpretability about the predictive model in terms of human-understandable high level attribute functions, with minimal loss of accuracy. This is achieved by a dedicated architecture and well chosen regularization penalties. We seek for a small-size dictionary of high level attribute functions that take as inputs the outputs of selected hidden layers and whose outputs feed a linear classifier. We impose strong conciseness on the activation of attributes with an entropy-based criterion while enforcing fidelity to both inputs and outputs of the predictive model. A detailed pipeline to visualize the learnt features is also developed. Moreover, besides generating interpretable models by design, our approach can be specialized to provide post-hoc interpretations for a pre-trained neural network. We validate our approach against several state-of-the-art methods on multiple datasets and show its efficacy on both kinds of tasks.
翻译:为了在深层学习中解决可解释性问题,我们提出了一个新框架,以共同学习预测模型及其相关解释模型。口译员以人类可以理解的高层次属性功能提供预测模型的当地和全球解释,同时尽可能降低准确性损失。这是通过专门的架构和精心选择的规范处罚实现的。我们寻求一个高层次属性功能的小型字典,该词典将选定隐蔽层的输出作为投入,其输出为线性分类器。我们非常简明地运用基于酶的标准启动属性,同时对预测模型的投入和输出实施忠实性。还开发了可视化所学特征的详细管道。此外,除了通过设计生成可解释模型外,我们的方法还可以专门为预先培训的神经网络提供热后解释。我们验证了我们针对多种数据集的一些最新方法的做法,并展示了这两种任务的效果。