In many machine learning applications, it is important for the user to understand the reasoning behind the recommendation or prediction of the classifiers. The learned models, however, are often too complicated to be understood by a human. Research from the social sciences indicates that humans prefer counterfactual explanations over alternatives. In this paper, we present a general framework for generating counterfactual explanations in the textual domain. Our framework is model-agnostic, representation-agnostic, domain-agnostic, and anytime. We model the task as a search problem in a space where the initial state is the classified text, and the goal state is a text in the complementary class. The operators transform a text by replacing parts of it. Our framework includes domain-independent operators, but can also exploit domain-specific knowledge through specialized operators. The search algorithm attempts to find a text from the complementary class with minimal word-level Levenshtein distance from the original classified object.
翻译:在许多机器学习应用中,用户必须理解分类者的建议或预测背后的推理。 然而,所学的模型往往过于复杂,无法为人类所理解。社会科学的研究表明,人类更喜欢反事实解释而不是替代。在本文中,我们提出了一个在文本领域产生反事实解释的一般框架。我们的框架是模型-不可知性、代表性-不可知性、域不可知性,以及随时。我们把这项任务作为搜索问题来模拟,在最初状态为机密文本,而目标状态是补充性分类中文本的地方。操作者通过替换其部分内容来转换文本。我们的框架包括了依赖域的操作者,但也可以通过专业操作者来利用特定领域的知识。搜索算法试图从补充类中找到一个文本,与原始分类对象保持最小的单级Levestein距离。