In many scenarios, the interpretability of machine learning models is a highly required but difficult task. To explain the individual predictions of such models, local model-agnostic approaches have been proposed. However, the process generating the explanations can be, for a user, as mysterious as the prediction to be explained. Furthermore, interpretability methods frequently lack theoretical guarantees, and their behavior on simple models is frequently unknown. While it is difficult, if not impossible, to ensure that an explainer behaves as expected on a cutting-edge model, we can at least ensure that everything works on simple, already interpretable models. In this paper, we present a theoretical analysis of Anchors (Ribeiro et al., 2018): a popular rule-based interpretability method that highlights a small set of words to explain a text classifier's decision. After formalizing its algorithm and providing useful insights, we demonstrate mathematically that Anchors produces meaningful results when used with linear text classifiers on top of a TF-IDF vectorization. We believe that our analysis framework can aid in the development of new explainability methods based on solid theoretical foundations.
翻译:在许多设想中,机器学习模型的可解释性是一项非常必要但十分困难的任务。为了解释这些模型的个别预测,已经提出了当地模型-不可知性办法。然而,对用户来说,产生解释的过程可能与预测所要解释的一样神秘。此外,可解释性方法往往缺乏理论保障,而且其对于简单模型的行为往往不为人所知。虽然很难(如果不是不可能的话)确保解释者在尖端模型上的行为符合预期,但我们至少可以确保一切工作都建立在简单、已经可解释的模式上。在本文件中,我们提出了对Anchors的理论分析(Ribeiro等人,2018年):一种基于规则的可解释性方法,强调解释文本分类者决定的一套小字眼。在将其算法正规化并提供有用的洞察力之后,我们从数学上表明,当在TF-IDF病媒化的顶部线性文本分类师使用时,Anchors产生有意义的结果。我们认为,我们的分析框架可以帮助发展基于坚实理论基础的新的解释性方法。</s>