Machine learning models are vulnerable to adversarial attacks. In this paper, we consider the scenario where a model is to be distributed to many users, among which a malicious user attempts to attack another user. The malicious user probes its unique copy of the model to search for adversarial samples, presenting found samples to the victim's model in order to replicate the attack. By distributing different copies of the model to different users, we can mitigate such attacks wherein adversarial samples found on one copy would not work on another copy. We propose a flexible parameter rewriting method that directly modifies the model's parameters. This method does not require training and is able to generate a large number of copies, where each copy induces different sets of adversarial samples. Experimentation studies show that our approach can significantly mitigate the attacks while retaining high accuracy.
翻译:机器学习模型很容易受到对抗性攻击。 在本文中, 我们考虑将模型分发给许多用户的情景, 其中恶意用户试图攻击另一个用户。 恶意用户检测其独特的模型副本, 搜索对抗性样品, 向受害者模型展示发现的样本, 以便复制攻击。 通过向不同的用户分发不同版本的模型, 我们可以减轻这种攻击, 其中一个版本上找到的对抗性样本不会在另一个版本上起作用 。 我们提出一个灵活的参数重写方法, 直接修改模型的参数 。 这个方法不需要培训, 并且能够生成大量的副本, 每份样本都产生不同的对抗性样本 。 实验研究表明, 我们的方法可以大大减轻攻击, 同时保持高度的准确性 。