A faithful and interpretable explanation of an AI model's behavior and internal structure is a high-level explanation that is human-intelligible but also consistent with the known, but often opaque low-level causal details of the model. We argue that the theory of causal abstraction provides the mathematical foundations for the desired kinds of model explanations. In causal abstraction analysis, we use interventions on model-internal states to rigorously assess whether an interpretable high-level causal model is a faithful description of an AI model. Our contributions in this area are: (1) We generalize causal abstraction to cyclic causal structures and typed high-level variables. (2) We show how multi-source interchange interventions can be used to conduct causal abstraction analyses. (3) We define a notion of approximate causal abstraction that allows us to assess the degree to which a high-level causal model is a causal abstraction of a lower-level one. (4) We prove constructive causal abstraction can be decomposed into three operations we refer to as marginalization, variable-merge, and value-merge. (5) We formalize the XAI methods of LIME, causal effect estimation, causal mediation analysis, iterated nullspace projection, and circuit-based explanations as special cases of causal abstraction analysis.
翻译:对AI模型的行为和内部结构的忠实和可解释的解释是高层次的解释,这是人类可以理解的,但也符合已知的、但往往不透明的低层次因果关系细节。我们争辩说,因果抽象理论为理想的模型解释提供了数学基础。在因果抽象分析中,我们利用模型内部国家的干预措施严格评估可解释的高因果模型是否是AI模型的忠实描述。我们在这一领域的贡献是:(1) 我们将因果抽象化概括到循环因果关系结构和输入高层次变量。 (2) 我们表明如何利用多源交换干预进行因果抽象分析。 (3) 我们界定了一种近似因果抽象概念,使我们能够评估高层次因果模型在多大程度上是较低层次的因果抽象。(4) 我们证明,建设性的因果抽象可以分解为我们所说的边缘化、变数和增值等三种行动。(5) 我们正式确定LIME的XAI方法、因果影响估计、因果调解分析、抽象的内空预测和电路分析。