Adversarial training has been widely explored for mitigating attacks against deep models. However, most existing works are still trapped in the dilemma between higher accuracy and stronger robustness since they tend to fit a model towards robust features (not easily tampered with by adversaries) while ignoring those non-robust but highly predictive features. To achieve a better robustness-accuracy trade-off, we propose the Vanilla Feature Distillation Adversarial Training (VFD-Adv), which conducts knowledge distillation from a pre-trained model (optimized towards high accuracy) to guide adversarial training towards higher accuracy, i.e., preserving those non-robust but predictive features. More specifically, both adversarial examples and their clean counterparts are forced to be aligned in the feature space by distilling predictive representations from the pre-trained/clean model, while previous works barely utilize predictive features from clean models. Therefore, the adversarial training model is updated towards maximally preserving the accuracy as gaining robustness. A key advantage of our method is that it can be universally adapted to and boost existing works. Exhaustive experiments on various datasets, classification models, and adversarial training algorithms demonstrate the effectiveness of our proposed method.
翻译:为减轻对深层模型的攻击,广泛探索了Aversarial Adversarial培训(VFD-Adv),但大多数现有工程仍然被困在更高的准确性和更强强的强健性之间的两难境地,因为它们倾向于将一个模型适合强健的特点(不轻易被对手篡改),而忽视这些非野蛮但高度预测的特点。为了实现更稳健、更准确的互换,我们提议采用Vanilla Feature Development Distraination Adversarial培训(VFD-Adv),该模型从预先培训的模型(优化到高度精确性)中进行知识蒸馏,以指导对抗性培训走向更高的准确性,即保护这些非野蛮但预测性的特点。更具体地说,对抗性实例及其清洁对应方都被迫通过从预先训练/清洁模型中提取预测表来调整功能空间的特征,而以前的工作几乎没有利用清洁模型的预测性特征。因此,对抗性培训模式正在更新,以最大限度地保持准确性,以获得稳健性。我们方法的一个主要优点是,我们的方法的优点在于它能够普遍适应和提升现有的研究模型。