Deep neural networks are susceptible to adversarial inputs and various methods have been proposed to defend these models against adversarial attacks under different perturbation models. The robustness of models to adversarial attacks has been analyzed by first constructing adversarial inputs for the model, and then testing the model performance on the constructed adversarial inputs. Most of these attacks require the model to be white-box, need access to data labels, and finding adversarial inputs can be computationally expensive. We propose a simple scoring method for black-box models which indicates their robustness to adversarial input. We show that adversarially more robust models have a smaller $l_1$-norm of LIME weights and sharper explanations.
翻译:深神经网络很容易受到对抗性投入的影响,并且已经提出了不同方法来保护这些模型不受不同扰动模式的对抗性攻击。 已经通过首先为模型建立对抗性投入来分析对抗性攻击模式的坚固性,然后根据已建的对抗性投入测试模型性能。 这些攻击大多要求模型是白箱,需要获得数据标签,找到对抗性投入可以计算昂贵。 我们为黑箱模型提出了一个简单的评分方法,表明其对对抗性投入的坚固性。 我们表明,较强的对抗性攻击模式的利美重量和解释要小于1美元一美元。