We present a new theoretical framework for making black box classifiers such as Neural Networks interpretable, basing our work on clear assumptions and guarantees. In our setting, which is inspired by the Merlin-Arthur protocol from Interactive Proof Systems, two functions cooperate to achieve a classification together: the \emph{prover} selects a small set of features as a certificate and presents it to the \emph{classifier}. Including a second, adversarial prover allows us to connect a game-theoretic equilibrium to information-theoretic guarantees on the exchanged features. We define notions of completeness and soundness that enable us to lower bound the mutual information between features and class. To demonstrate good agreement between theory and practice, we support our framework by providing numerical experiments for Neural Network classifiers, explicitly calculating the mutual information of features with respect to the class.
翻译:我们提出了一个新的理论框架,使神经网络等黑盒分类器可以解释,我们的工作以明确的假设和保障为基础。在我们这个受互动验证系统梅林-阿瑟协议启发的环境中,有两个功能合作实现分类:\emph{prover}选择一小组特征作为证书,并将其提交\emph{clasticer}。包括第二个,对立验证器,使我们能够将游戏理论平衡与交换特征上的信息理论保障联系起来。我们定义了完整性和健全性的概念,使我们能够降低特征和阶级之间的相互信息约束。为了在理论与实践之间表现出良好的一致,我们支持我们的框架,为神经网络分类者提供数字实验,明确计算与类别有关的特征的相互信息。