Explainable AI has become a popular tool for validating machine learning models. Mismatches between the explained model's decision strategy and the user's domain knowledge (e.g. Clever Hans effects) have also been recognized as a starting point for improving faulty models. However, it is less clear what to do when the user and the explanation agree. In this paper, we demonstrate that acceptance of explanations by the user is not a guarantee for a ML model to function well, in particular, some Clever Hans effects may remain undetected. Such hidden flaws of the model can nevertheless be mitigated, and we demonstrate this by contributing a new method, Explanation-Guided Exposure Minimization (EGEM), that premptively prunes variations in the ML model that have not been the subject of positive explanation feedback. Experiments on natural image data demonstrate that our approach leads to models that strongly reduce their reliance on hidden Clever Hans strategies, and consequently achieve higher accuracy on new data.
翻译:可解释的AI已成为验证机器学习模型的常用工具。解释模型的决策策略与用户的领域知识之间的不匹配(例如Clever Hans效应)也被认为是改进故障模型的起点。但是,当用户和解释达成一致时,该怎么办并不太清楚。在本文中,我们证明用户接受解释并不意味着机器学习模型一定能很好地发挥作用,特别是有些Clever Hans效应可能仍然被隐藏。然而,这种模型的隐藏缺陷可以得到缓解,并且我们通过提出一种新方法——解释引导的暴露度最小化(EGEM),在模型中预先剪枝未被正反馈解释覆盖的差异。对自然图像数据的实验表明,我们的方法导致模型大大降低其对隐藏Clever-Hans策略的依赖性,并因此在新数据上实现更高的准确性。