反反向工程对抗性攻击,从对抗性攻击实例中提取指纹 (Reverse engineering adversarial attacks with fingerprints from adversarial examples)

In spite of intense research efforts, deep neural networks remain vulnerable to adversarial examples: an input that forces the network to confidently produce incorrect outputs. Adversarial examples are typically generated by an attack algorithm that optimizes a perturbation added to a benign input. Many such algorithms have been developed. If it were possible to reverse engineer attack algorithms from adversarial examples, this could deter bad actors because of the possibility of attribution. Here we formulate reverse engineering as a supervised learning problem where the goal is to assign an adversarial example to a class that represents the algorithm and parameters used. To our knowledge it has not been previously shown whether this is even possible. We first test whether we can classify the perturbations added to images by attacks on undefended single-label image classification models. Taking a "fight fire with fire" approach, we leverage the sensitivity of deep neural networks to adversarial examples, training them to classify these perturbations. On a 17-class dataset (5 attacks, 4 bounded with 4 epsilon values each), we achieve an accuracy of 99.4% with a ResNet50 model trained on the perturbations. We then ask whether we can perform this task without access to the perturbations, obtaining an estimate of them with signal processing algorithms, an approach we call "fingerprinting". We find the JPEG algorithm serves as a simple yet effective fingerprinter (85.05% accuracy), providing a strong baseline for future work. We discuss how our approach can be extended to attack agnostic, learnable fingerprints, and to open-world scenarios with unknown attacks.

翻译：尽管进行了大量研究,但深神经网络仍然易受对抗性例子的影响:这种输入迫使网络有信心地产生不正确的产出。相反的例子通常是由攻击算法生成的,该算法优化了扰动,添加到无害的输入中。许多这样的算法已经开发出来。如果有可能从对抗性例子中逆转工程师攻击算法,这可能会因为归属的可能性而吓退不良的行为者。我们在这里将反向工程设计成一个监督性的学习问题,目标是将一个对抗性例子指派给一个代表使用的算法和参数的类别。据我们所知,它以前没有显示这是否可能。我们首先测试我们是否能够对通过攻击非防御性单一标签图像分类模型而添加到图像中的扰动干扰进行分类。我们利用深神经网络的灵敏度作为对抗性例子,训练他们对这些扰动进行分类。在17级数据集上(5次攻击,4次开着4份纸质标,每个值),我们可以用ResNet50型模型来准确度。我们能否通过一个未知的算式算法,我们能否用一个未知的手法来进行访问。然后我们能否通过一个错误的算。我们能否通过一个简单的算来学习一个简单的算。我们如何理解。