The ability to transfer adversarial attacks from one model (the surrogate) to another model (the victim) has been an issue of concern within the machine learning (ML) community. The ability to successfully evade unseen models represents an uncomfortable level of ease toward implementing attacks. In this work we note that as studied, current transfer attack research has an unrealistic advantage for the attacker: the attacker has the exact same training data as the victim. We present the first study of transferring adversarial attacks focusing on the data available to attacker and victim under imperfect settings without querying the victim, where there is some variable level of overlap in the exact data used or in the classes learned by each model. This threat model is relevant to applications in medicine, malware, and others. Under this new threat model attack success rate is not correlated with data or class overlap in the way one would expect, and varies with dataset. This makes it difficult for attacker and defender to reason about each other and contributes to the broader study of model robustness and security. We remedy this by developing a masked version of Projected Gradient Descent that simulates class disparity, which enables the attacker to reliably estimate a lower-bound on their attack's success.
翻译:将对抗性攻击从一个模型(代身者)转移到另一个模型(受害人)的能力一直是机器学习(ML)社区内一个令人关切的问题。成功回避不可见的模式对实施攻击来说是一种不便的便利程度。我们在此工作中注意到,正如研究所研究的那样,目前的转移性攻击研究对攻击者具有不切实际的优势:攻击者拥有与受害者完全相同的培训数据。我们首次研究将对抗性攻击从一个模型(代身者)转移到另一个模型(受害人),重点是攻击者和受害者在不完善的环境中可获得的数据,而不询问受害者,在每种模型使用的确切数据或所学的班级中存在一些不同程度的重叠。这种威胁性模型与医学、恶意软件和其他方面的应用有关。在这个新的威胁性模式下,攻击成功率与人们预期的数据或班级重叠没有关联,与数据集不同。这使得攻击者和辩护者难以相互解释对方的原因,并有助于更广泛的模型坚固和安全性研究。我们通过开发一个隐蔽的模型来纠正这一点,从而使攻击者能够可靠地估计其攻击成功程度。