你看到的不是网络的推论: 检测基于语义矛盾的对立实例 (What You See is Not What the Network Infers: Detecting Adversarial Examples Based on Semantic Contradiction)

Adversarial examples (AEs) pose severe threats to the applications of deep neural networks (DNNs) to safety-critical domains, e.g., autonomous driving. While there has been a vast body of AE defense solutions, to the best of our knowledge, they all suffer from some weaknesses, e.g., defending against only a subset of AEs or causing a relatively high accuracy loss for legitimate inputs. Moreover, most existing solutions cannot defend against adaptive attacks, wherein attackers are knowledgeable about the defense mechanisms and craft AEs accordingly. In this paper, we propose a novel AE detection framework based on the very nature of AEs, i.e., their semantic information is inconsistent with the discriminative features extracted by the target DNN model. To be specific, the proposed solution, namely ContraNet, models such contradiction by first taking both the input and the inference result to a generator to obtain a synthetic output and then comparing it against the original input. For legitimate inputs that are correctly inferred, the synthetic output tries to reconstruct the input. On the contrary, for AEs, instead of reconstructing the input, the synthetic output would be created to conform to the wrong label whenever possible. Consequently, by measuring the distance between the input and the synthetic output with metric learning, we can differentiate AEs from legitimate inputs. We perform comprehensive evaluations under various AE attack scenarios, and experimental results show that ContraNet outperforms existing solutions by a large margin, especially under adaptive attacks. Moreover, our analysis shows that successful AEs that can bypass ContraNet tend to have much-weakened adversarial semantics. We have also shown that ContraNet can be easily combined with adversarial training techniques to achieve further improved AE defense capabilities.

翻译：Adversarial examples (AEs) 对深神经网络(DNNS)在安全关键领域的应用(例如自主驱动)构成了严重威胁。虽然我们最了解的是,它们都存在大量的AE防御解决方案,但它们都存在一些弱点,例如,仅针对一组AE的防御,或者对合法投入造成较高的准确损失。此外,大多数现有解决方案无法抵御适应性袭击,即攻击者了解防御机制,并因此了解AE的计算方法。在本文中,我们基于AEs(即成功驱动)本身的性质,提出了新的AE检测框架。尽管它们的语言信息与DNNN模型所绘制的歧视性特征不相符,但具体地说,拟议的解决方案,即Contranet,这种模型,先利用输入和推断结果,先到发电机获得合成产出,然后与原始投入进行比较。对于合理性投入而言,我们更清楚地推断,合成产出尝试重建输入。相反,相反,在AEE(而不是在大规模适应性攻击的情况下)评估,只要从A类内分析, 合成输出能力显示我们所创造的变式数据,就能显示我们所创造的A。