Adversarial attacks insert small, imperceptible perturbations to input samples that cause large, undesired changes to the output of deep learning models. Despite extensive research on generating adversarial attacks and building defense systems, there has been limited research on understanding adversarial attacks from an input-data perspective. This work introduces the notion of sample attackability, where we aim to identify samples that are most susceptible to adversarial attacks (attackable samples) and conversely also identify the least susceptible samples (robust samples). We propose a deep-learning-based method to detect the adversarially attackable and robust samples in an unseen dataset for an unseen target model. Experiments on standard image classification datasets enables us to assess the portability of the deep attackability detector across a range of architectures. We find that the deep attackability detector performs better than simple model uncertainty-based measures for identifying the attackable/robust samples. This suggests that uncertainty is an inadequate proxy for measuring sample distance to a decision boundary. In addition to better understanding adversarial attack theory, it is found that the ability to identify the adversarially attackable and robust samples has implications for improving the efficiency of sample-selection tasks, e.g. active learning in augmentation for adversarial training.
翻译:识别易遭受对抗性攻击和强韧样本
对抗攻击指将微小、难以察觉的扰动插入输入样本中,从而导致深度学习模型输出大幅不良变化的行为。尽管已进行了广泛的对抗攻击生成和防御系统构建研究,但对从输入数据角度理解对抗攻击的研究仍有限。本文提出了一个样本攻击能力(attackability)的概念,旨在识别最易遭受对抗攻击的样本(attackable samples),反之也找出最不易遭受攻击的样本(robust samples)。我们提出了一种基于深度学习的方法,用于检测未见过的目标模型在未见过的数据集中的易受攻击和强韧样本。在标准图像分类数据集上的实验允许我们对不同体系结构下深度攻击能力探测器的可移植性进行评估。结果表明,相对于简单的基于模型不确定性的指标,深度攻击能力探测器表现更好,能够更好地识别易遭受攻击/强韧的样本。这表明,不确定性是衡量样本距决策边界距离的不充分代理。除了更好地理解对抗攻击理论,该文发现识别易遭受对抗性攻击和强韧样本的能力还对提高样本选择任务的效率有启示,例如对抗训练的主动学习和增强方法。