Explainable machine learning holds great potential for analyzing and understanding learning-based systems. These methods can, however, be manipulated to present unfaithful explanations, giving rise to powerful and stealthy adversaries. In this paper, we demonstrate blinding attacks that can fully disguise an ongoing attack against the machine learning model. Similar to neural backdoors, we modify the model's prediction upon trigger presence but simultaneously also fool the provided explanation. This enables an adversary to hide the presence of the trigger or point the explanation to entirely different portions of the input, throwing a red herring. We analyze different manifestations of such attacks for different explanation types in the image domain, before we resume to conduct a red-herring attack against malware classification.
翻译:可解释的机器学习具有分析和理解学习系统的巨大潜力。 但是,这些方法可以被操纵,以提出不忠的解释,从而产生强大和隐形的对手。 在本文中,我们展示了能够完全掩盖对机器学习模式的持续攻击的盲目的攻击。 和神经后门一样,我们在触发器出现时修改模型的预测,但同时也愚弄了所提供的解释。 这使得对手能够隐藏触发器的存在,或将解释点辨完全不同的部分输入, 扔出一条红色的套索。 我们在恢复对恶意软件分类的重新攻击之前, 分析图像域中不同类型不同攻击的不同表现形式 。