With increasingly widespread use of deep neural networks in critical decision-making applications, interpretability of these models is becoming imperative. We consider the problem of jointly learning a predictive model and its associated interpretation model. The task of the interpreter is to provide both local and global interpretability about the predictive model in terms of human-understandable high level attribute functions, without any loss of accuracy. This is achieved by a dedicated architecture and well chosen regularization penalties. We seek for a small-size dictionary of attribute functions that take as inputs the outputs of selected hidden layers and whose outputs feed a linear classifier. We impose a high level of conciseness by constraining the activation of a very few attributes for a given input with a real-entropy-based criterion while enforcing fidelity to both inputs and outputs of the predictive model. A major advantage of simultaneous learning is that the predictive neural network benefits from the interpretability constraint as well. We also develop a more detailed pipeline based on some common and novel simple tools to develop understanding about the learnt features. We show on two datasets, MNIST and QuickDraw, their relevance for both global and local interpretability.
翻译:随着在关键决策应用中日益广泛使用深层神经网络,这些模型的可解释性变得势在必行。我们考虑到共同学习预测模型及其相关解释模型的问题。口译员的任务是在不丧失任何准确性的情况下,就人类可理解的高层次属性功能的预测模型提供当地和全球的解释性模型,而不会丧失任何准确性。这通过专门的架构和精心选择的规范处罚来实现。我们寻求一个小型的属性功能字典,将选定隐性层的产出作为投入,其产出为线性分类器提供。我们通过限制以实际有机标准对某项投入的极少数属性的激活,同时对预测模型的投入和产出的忠实性进行当地和全球解释性解释性解释性解释。同时学习的一个主要好处是,预测性神经网络也受益于解释性制约。我们还利用一些共同和新颖的简单工具开发一个更加详细的管道,以增进对所学特征的了解。我们展示了两个数据集,即MNIST和QuickDraw。