Delusive attacks aim to substantially deteriorate the test accuracy of the learning model by slightly perturbing the features of correctly labeled training examples. By formalizing this malicious attack as finding the worst-case training data within a specific $\infty$-Wasserstein ball, we show that minimizing adversarial risk on the perturbed data is equivalent to optimizing an upper bound of natural risk on the original data. This implies that adversarial training can serve as a principled defense against delusive attacks. Thus, the test accuracy decreased by delusive attacks can be largely recovered by adversarial training. To further understand the internal mechanism of the defense, we disclose that adversarial training can resist the delusive perturbations by preventing the learner from overly relying on non-robust features in a natural setting. Finally, we complement our theoretical findings with a set of experiments on popular benchmark datasets, which show that the defense withstands six different practical attacks. Both theoretical and empirical results vote for adversarial training when confronted with delusive adversaries.
翻译:故意攻击的目的是通过略微扰乱贴有正确标签的培训实例的特征,大大降低学习模式的测试准确性。我们通过将这一恶意攻击正规化为在特定的美元-瓦瑟斯坦球中找到最坏情况的培训数据,表明将受扰动数据中的对抗风险降到最低程度,相当于优化原始数据中自然风险的上限。这意味着对抗性训练可以作为抵御欺骗性攻击的原则性防御。因此,通过对抗性攻击的测试准确性可以在很大程度上通过对抗性训练来恢复。为了进一步理解内部防御机制,我们透露对抗性训练可以抵制破坏性干扰,防止学习者过度依赖自然环境中的非野蛮特征。最后,我们用一套关于流行基准数据集的实验来补充我们的理论结论,这些实验表明,防御性攻击可以经受六种不同的实际攻击。在与击溃动性对手对抗时,对对抗性攻击的理论性和经验性结果投票。