The increasing size and complexity of modern ML systems has improved their predictive capabilities but made their behavior harder to explain. Many techniques for model explanation have been developed in response, but we lack clear criteria for assessing these techniques. In this paper, we cast model explanation as the causal inference problem of estimating causal effects of real-world concepts on the output behavior of ML models given actual input data. We introduce CEBaB, a new benchmark dataset for assessing concept-based explanation methods in Natural Language Processing (NLP). CEBaB consists of short restaurant reviews with human-generated counterfactual reviews in which an aspect (food, noise, ambiance, service) of the dining experience was modified. Original and counterfactual reviews are annotated with multiply-validated sentiment ratings at the aspect-level and review-level. The rich structure of CEBaB allows us to go beyond input features to study the effects of abstract, real-world concepts on model behavior. We use CEBaB to compare the quality of a range of concept-based explanation methods covering different assumptions and conceptions of the problem, and we seek to establish natural metrics for comparative assessments of these methods.
翻译:现代ML系统的规模和复杂性的日益扩大和复杂性提高了预测能力,但使其行为更难解释。许多示范解释技术是针对这些技术开发的,但我们缺乏评估这些技术的明确标准。在本文中,我们将模型解释作为估计现实世界概念对ML模型产出行为的实际投入数据产生的因果影响的因果推论问题。我们引入了CEBaB,这是用于评估基于概念的自然语言处理解释方法(NLP)的新的基准数据集。CEBaB由简短的餐馆审查与人类产生的反事实审查组成,其中改变了饮食经验的一个方面(食物、噪音、娱乐、服务)。原始和反事实审查在侧面和审查层面都附有倍增量的情绪评级说明。CEBAB的丰富结构使我们能够超越输入特征,研究抽象的、真实世界概念对模式行为的影响。我们使用CEBAB将一系列基于概念的解释方法的质量进行比较,涵盖对问题的不同假设和概念,我们寻求建立自然计量,以比较这些方法的比较评估。