In recent years numerous methods have been developed to formally verify the robustness of deep neural networks (DNNs). Though the proposed techniques are effective in providing mathematical guarantees about the DNNs behavior, it is not clear whether the proofs generated by these methods are human-interpretable. In this paper, we bridge this gap by developing new concepts, algorithms, and representations to generate human understandable interpretations of the proofs. Leveraging the proposed method, we show that the robustness proofs of standard DNNs rely on spurious input features, while the proofs of DNNs trained to be provably robust filter out even the semantically meaningful features. The proofs for the DNNs combining adversarial and provably robust training are the most effective at selectively filtering out spurious features as well as relying on human-understandable input features.
翻译:近年来,已经开发了许多方法来正式核实深层神经网络(DNNs)的稳健性。 尽管拟议的技术在为DNS行为提供数学保障方面是有效的,但尚不清楚这些方法产生的证据是否是人类解释的。 在本文中,我们通过开发新的概念、算法和表述来弥补这一差距,从而产生人类对证据的可理解解释。 我们利用拟议的方法,我们表明标准DNS的稳健性证据依赖于虚假的投入特征,而经过培训的DNS的证明则是能够可靠地过滤的,甚至排除语义上有意义的特征。 将对抗性和强力培训相结合的DNNS的证明对于有选择地筛选出虚假特征以及依赖人所能够理解的输入特征最为有效。