Anchors [Ribeiro et al. (2018)] is a post-hoc, rule-based interpretability method. For text data, it proposes to explain a decision by highlighting a small set of words (an anchor) such that the model to explain has similar outputs when they are present in a document. In this paper, we present the first theoretical analysis of Anchors, considering that the search for the best anchor is exhaustive. We leverage this analysis to gain insights on the behavior of Anchors on simple models, including elementary if-then rules and linear classifiers.
翻译:Anchors [Ribeiro et al. (2018年) 是一种基于规则的事后解释方法。 关于文本数据,它建议解释一项决定,突出一小组词(锚),这样当文件显示时,解释其结果的模型就具有相似的结果。在本文件中,我们介绍对Anchors的第一次理论分析,认为寻找最佳锚是详尽无遗的。我们利用这一分析来了解Anchors在简单模型上的行为,包括原始规则(如果当时规则)和线性分类者。