Machine learning models are used in many sensitive areas where besides predictive accuracy their comprehensibility is also important. Interpretability of prediction models is necessary to determine their biases and causes of errors, and is a necessary prerequisite for users' confidence. For complex state-of-the-art black-box models post-hoc model-independent explanation techniques are an established solution. Popular and effective techniques, such as IME, LIME, and SHAP, use perturbation of instance features to explain individual predictions. Recently, Slack et al. (2020) put their robustness into question by showing that their outcomes can be manipulated due to poor perturbation sampling employed. This weakness would allow dieselgate type cheating of owners of sensitive models who could deceive inspection and hide potentially unethical or illegal biases existing in their predictive models. This could undermine public trust in machine learning models and give rise to legal restrictions on their use. We show that better sampling in these explanation methods prevents malicious manipulations. The proposed sampling uses data generators that learn the training set distribution and generate new perturbation instances much more similar to the training set. We show that the improved sampling increases the robustness of the LIME and SHAP, while previously untested method IME is already the most robust of all.
翻译:在许多敏感领域,除了预测准确性外,还使用了机器学习模型,这些模型的可理解性也很重要。预测模型的解释性对于确定其偏差和误差原因是必要的,也是用户信心的必要先决条件。对于复杂的先进黑盒模型来说,休克模型后独立模型解释技术是一种既定的解决办法。流行和有效的技术,如IME、LIME、LIME和SHAP等,使用扰动实例特征来解释个人预测。最近,Slack等人(20202020年)对其稳健性提出了质疑,表明其结果可因使用的扰动抽样差而被操纵。这一弱点将允许敏感模型的拥有者进行柴油式欺骗,这些拥有者可能欺骗检查,并隐藏其预测模型中存在的潜在不道德或非法的偏差。这可能会破坏公众对机器学习模型的信任,并导致对其使用的法律限制。我们表明,在这些解释方法中更好的取样方法防止恶意操纵。拟议抽样使用数据生成器,了解训练数据集的分布,并产生与培训集非常相似的新扰动实例。我们发现,改进的取样方法已经提高了以往的稳健性。