Interpretability methods are developed to understand the working mechanisms of black-box models, which is crucial to their responsible deployment. Fulfilling this goal requires both that the explanations generated by these methods are correct and that people can easily and reliably understand them. While the former has been addressed in prior work, the latter is often overlooked, resulting in informal model understanding derived from a handful of local explanations. In this paper, we introduce explanation summary (ExSum), a mathematical framework for quantifying model understanding, and propose metrics for its quality assessment. On two domains, ExSum highlights various limitations in the current practice, helps develop accurate model understanding, and reveals easily overlooked properties of the model. We also connect understandability to other properties of explanations such as human alignment, robustness, and counterfactual minimality and plausibility.
翻译:制定解释方法,以了解黑箱模型的工作机制,这对以负责任的方式部署这些模型至关重要。实现这一目标,既需要这些方法产生的解释正确,又需要人们能够容易可靠地理解这些方法。虽然以前的工作已经解决了前者,但后者往往被忽视,从而导致从少数当地解释中得出非正式的示范理解。在本文件中,我们引入解释摘要(ExSum),这是一个量化模型理解的数学框架,并为质量评估提出衡量标准。在两个领域,ExSum突出当前做法中的各种限制,有助于形成准确的模型理解,并揭示模型的易被忽略特性。我们还将理解性与其他解释特性联系起来,例如人与人之间的一致、稳健、反实际的最小性和可信赖性。