Machine Learning (ML) models are known to be vulnerable to adversarial inputs and researchers have demonstrated that even production systems, such as self-driving cars and ML-as-a-service offerings, are susceptible. These systems represent a target for bad actors. Their disruption can cause real physical and economic harm. When attacks on production ML systems occur, the ability to attribute the attack to the responsible threat group is a critical step in formulating a response and holding the attackers accountable. We pose the following question: can adversarially perturbed inputs be attributed to the particular methods used to generate the attack? In other words, is there a way to find a signal in these attacks that exposes the attack algorithm, model architecture, or hyperparameters used in the attack? We introduce the concept of adversarial attack attribution and create a simple supervised learning experimental framework to examine the feasibility of discovering attributable signals in adversarial attacks. We find that it is possible to differentiate attacks generated with different attack algorithms, models, and hyperparameters on both the CIFAR-10 and MNIST datasets.
翻译:众所周知,机器学习模式很容易受到对抗性投入的影响,研究人员已经证明,甚至自驾驶汽车和ML-as-services等生产系统也容易受到攻击。这些系统代表了不良行为者的目标。它们的中断可能造成真正的物质和经济伤害。当生产ML系统受到攻击时,将攻击归咎于负责任的威胁群体的能力是制定对策和追究攻击者责任的关键步骤。我们提出以下问题:对抗性干扰性投入能否归因于用来制造攻击的特定方法?换句话说,在这些攻击中找到暴露攻击算法、模型结构或攻击中使用的超光谱仪的信号吗?我们引入了对抗性攻击归属概念,并创建了一个简单的、有监督的实验框架,以研究在对抗性攻击中发现可识别信号的可行性。我们发现,有可能用不同的攻击算法、模型和超光谱来区分在CIFAR-10和MISIC数据集上发生的攻击。