Recent work shows that deep neural networks are vulnerable to adversarial examples. Much work studies adversarial example generation, while very little work focuses on more critical adversarial defense. Existing adversarial detection methods usually make assumptions about the adversarial example and attack method (e.g., the word frequency of the adversarial example, the perturbation level of the attack method). However, this limits the applicability of the detection method. To this end, we propose TREATED, a universal adversarial detection method that can defend against attacks of various perturbation levels without making any assumptions. TREATED identifies adversarial examples through a set of well-designed reference models. Extensive experiments on three competitive neural networks and two widely used datasets show that our method achieves better detection performance than baselines. We finally conduct ablation studies to verify the effectiveness of our method.
翻译:最近的工作表明,深层神经网络容易受到对抗性实例的伤害。许多工作研究都以对抗性实例生成为主,但很少有工作侧重于更关键的对抗性辩护。现有的对抗性侦查方法通常对对抗性实例和攻击方法(例如,对抗性实例的字频率、攻击方法的扰动程度)作出假设。然而,这限制了探测方法的适用性。为此,我们提议采用一种通用的对抗性对抗性探测方法,既能防范各种扰动程度的攻击,又不作任何假设。通过一套设计良好的参考模型确定了对抗性实例。对三个竞争性神经网络的广泛实验和两个广泛使用的数据集表明,我们的方法比基线的探测性能要好。我们最终进行了反向研究,以核实我们的方法的有效性。