This paper aims to illustrate the concept-emerging phenomenon in a trained DNN. Specifically, we find that the inference score of a DNN can be disentangled into the effects of a few interactive concepts. These concepts can be understood as causal patterns in a sparse, symbolic causal graph, which explains the DNN. The faithfulness of using such a causal graph to explain the DNN is theoretically guaranteed, because we prove that the causal graph can well mimic the DNN's outputs on an exponential number of different masked samples. Besides, such a causal graph can be further simplified and re-written as an And-Or graph (AOG), without losing much explanation accuracy.
翻译:本文旨在阐明训练过的DNN中的概念出现现象。具体来说,我们发现DNN的推理得分可以分解为几个交互概念的影响。这些概念可以理解为稀疏的符号因果图中的因果模式,从而解释DNN。使用这样的因果图来解释DNN的忠实程度在理论上得到保证,因为我们证明了这个因果图可以在指数级别的不同屏蔽样本上很好地模拟DNN的输出。此外,这样的因果图可以进一步简化并重新编写为And-Or图(AOG),而不会失去太多的解释准确性。