Adversarial robustness continues to be a major challenge for deep learning. A core issue is that robustness to one type of attack often fails to transfer to other attacks. While prior work establishes a theoretical trade-off in robustness against different $L_p$ norms, we show that there is potential for improvement against many commonly used attacks by adopting a domain generalisation approach. Concretely, we treat each type of attack as a domain, and apply the Risk Extrapolation method (REx), which promotes similar levels of robustness against all training attacks. Compared to existing methods, we obtain similar or superior worst-case adversarial robustness on attacks seen during training. Moreover, we achieve superior performance on families or tunings of attacks only encountered at test time. On ensembles of attacks, our approach improves the accuracy from 3.4% the best existing baseline to 25.9% on MNIST, and from 16.9% to 23.5% on CIFAR10.
翻译:一个核心问题是,对一种类型的攻击的强力往往不能转移到其他攻击中。虽然先前的工作确立了对不同的美元标准在强力方面的理论权衡,但我们通过采用一个领域概括化方法,表明对许多常用的攻击有改进的潜力。具体地说,我们把每一种攻击都当作一个领域,并应用风险外推法(REx),该方法促进对所有训练攻击的类似强力。与现有方法相比,我们对在训练期间看到的攻击获得类似或超强的最坏的对抗性强力。此外,我们还实现了对家属的优异表现或仅在试验时才遇到的攻击的调试。在各种攻击中,我们的方法提高了准确性,从现有最佳基线的3.4%提高到MNIST的25.9%,而CIFAR10的精确度从16.9%提高到23.5%。