Contrastive explanations for understanding the behavior of black box models has gained a lot of attention recently as they provide potential for recourse. In this paper, we propose a method Contrastive Attributed explanations for Text (CAT) which provides contrastive explanations for natural language text data with a novel twist as we build and exploit attribute classifiers leading to more semantically meaningful explanations. To ensure that our contrastive generated text has the fewest possible edits with respect to the original text, while also being fluent and close to a human generated contrastive, we resort to a minimal perturbation approach regularized using a BERT language model and attribute classifiers trained on available attributes. We show through qualitative examples and a user study that our method not onlyconveysmoreinsightbecauseoftheseattributes,butalsoleadstobetterquality(contrastive)text. Quantitatively, we show that our method outperforms other state-of-the-art methods across four data sets on four benchmark metrics.
翻译:理解黑盒模型行为的对比性解释最近引起了许多关注,因为它们提供了追索的可能性。 在本文中,我们提出了一种方法,即对文本的对比性属性解释(CAT),该方法为自然语言文本数据提供了对比性解释,在我们建立和开发属性分类器时提供了一种新颖的曲折,导致更具有语义意义的解释。为了确保我们对比性生成的文本对原始文本的编辑尽可能少,同时也是流利的,接近人类生成的对比性,我们采用了一种最起码的扰动方法,使用BERT语言模型和根据现有属性培训的属性分类器进行常规化。我们通过定性实例和用户研究显示,我们的方法不仅能够更清楚地理解这些属性,而且还能同时引领更好的质量( conveysbilective)文本。从数量上看,我们显示我们的方法超越了四个基准指标的四套数据。