The phenomenon of adversarial examples illustrates one of the most basic vulnerabilities of deep neural networks. Among the variety of techniques introduced to surmount this inherent weakness, adversarial training has emerged as the most effective strategy to achieve robustness. Typically, this is achieved by balancing robust and natural objectives. In this work, we aim to further optimize the trade-off between robust and standard accuracy by enforcing a domain-invariant feature representation. We present a new adversarial training method, Domain Invariant Adversarial Learning (DIAL), which learns a feature representation that is both robust and domain invariant. DIAL uses a variant of Domain Adversarial Neural Network (DANN) on the natural domain and its corresponding adversarial domain. In the case where the source domain consists of natural examples and the target domain is the adversarially perturbed examples, our method learns a feature representation constrained not to discriminate between the natural and adversarial examples, and can therefore achieve a more robust representation. Our experiments indicate that our method improves both robustness and standard accuracy, when compared to other state-of-the-art adversarial training methods.
翻译:对抗性实例的现象说明了深层神经网络的最基本弱点之一。在为克服这一内在弱点而采用的各种技术中,对抗性培训已成为实现稳健的最有效战略。通常,这是通过平衡稳健和自然目标来实现的。在这项工作中,我们的目标是通过执行一个域差异特征说明进一步优化稳健和标准准确性之间的权衡。我们提出了一种新的对抗性培训方法,即Domain Invariant Adversarial Learning(DIAL),它学会了一种既强又易变的特征说明。DIAL在自然领域及其相应的对抗性领域使用了DAN(Domain Adversarial Neal Net)的变式说明。如果源领域包括自然实例,而目标领域是对抗性隐蔽的例子,那么我们的方法就学会了一种特征说明,但不会区分自然和对抗性实例,因此可以实现更强有力的说明。我们的实验表明,与其他状态的对抗性训练方法相比,我们的方法既能提高稳健性和标准准确性。