Machine learning models are vulnerable to adversarial attacks. In this paper, we consider the scenario where a model is to be distributed to multiple users, among which a malicious user attempts to attack another user. The malicious user probes its copy of the model to search for adversarial samples and then presents the found samples to the victim's model in order to replicate the attack. We point out that by distributing different copies of the model to different users, we can mitigate the attack such that adversarial samples found on one copy would not work on another copy. We first observed that training a model with different randomness indeed mitigates such replication to certain degree. However, there is no guarantee and retraining is computationally expensive. Next, we propose a flexible parameter rewriting method that directly modifies the model's parameters. This method does not require additional training and is able to induce different sets of adversarial samples in different copies in a more controllable manner. Experimentation studies show that our approach can significantly mitigate the attacks while retaining high classification accuracy. From this study, we believe that there are many further directions worth exploring.
翻译:机器学习模型很容易受到对抗性攻击。 在本文中, 我们考虑将模型分发给多个用户的情景, 其中恶意用户试图攻击另一个用户。 恶意用户检测模型的副本, 搜索对抗性样品, 然后将发现的样本提交受害者模型, 以便复制攻击。 我们指出, 通过向不同用户分发不同版本的模型, 我们可以减轻攻击, 使一个副本上找到的对立样本不会对另一个副本起作用。 我们首先发现, 以不同随机性培训模型确实在某种程度上减轻了这种复制。 但是, 没有保证和再培训是计算成本高昂的。 接下来, 我们提出一个灵活的参数重写方法, 直接修改模型参数参数参数参数参数参数参数参数参数参数参数参数参数参数参数参数参数参数参数参数参数参数。 这种方法不需要额外培训, 并且能够以更可控制的方式在不同版本中生成不同版本的对立样本。 实验研究表明, 我们的方法可以显著地减轻攻击, 同时保留高分类准确性。 我们相信, 还有很多值得探索的方向。