Post-hoc explanation methods are used with the intent of providing insights about neural networks and are sometimes said to help engender trust in their outputs. However, popular explanations methods have been found to be fragile to minor perturbations of input features or model parameters. Relying on constraint relaxation techniques from non-convex optimization, we develop a method that upper-bounds the largest change an adversary can make to a gradient-based explanation via bounded manipulation of either the input features or model parameters. By propagating a compact input or parameter set as symbolic intervals through the forwards and backwards computations of the neural network we can formally certify the robustness of gradient-based explanations. Our bounds are differentiable, hence we can incorporate provable explanation robustness into neural network training. Empirically, our method surpasses the robustness provided by previous heuristic approaches. We find that our training method is the only method able to learn neural networks with certificates of explanation robustness across all six datasets tested.
翻译:事后解释方法的用意是提供神经网络的洞察力,有时据说有助于使人们对其输出产生信任。然而,人们的解释方法被认为对输入特征或模型参数的轻微扰动十分脆弱。依靠非convex优化的抑制放松技术,我们开发了一种方法,将对手通过对输入特征或模型参数进行约束性操纵而实现的最大变化上限为梯度解释。通过神经网络的前向和后向计算,作为象征性间隔而设置的紧凑输入或参数,我们可以正式证明梯度解释的稳健性。我们的界限是不同的,因此我们可以将可辨别的解释稳健性纳入神经网络培训。偶然地说,我们的方法超过了先前的超常态方法所提供的稳健性。我们发现,我们的培训方法是唯一能够学习神经网络的方法,并在所有六个测试的数据集中提供解释稳健性证明的方法。