Traditional (fickle) adversarial examples involve finding a small perturbation that does not change an input's true label but confuses the classifier into outputting a different prediction. Conversely, obstinate adversarial examples occur when an adversary finds a small perturbation that preserves the classifier's prediction but changes the true label of an input. Adversarial training and certified robust training have shown some effectiveness in improving the robustness of machine learnt models to fickle adversarial examples. We show that standard adversarial training methods focused on reducing vulnerability to fickle adversarial examples may make a model more vulnerable to obstinate adversarial examples, with experiments for both natural language inference and paraphrase identification tasks. To counter this phenomenon, we introduce Balanced Adversarial Training, which incorporates contrastive learning to increase robustness against both fickle and obstinate adversarial examples.
翻译:传统的(fickle)对抗性实例涉及找到一个不会改变输入的真实标签的微小扰动,但会将分类者混为一谈,得出不同的预测。相反,当对手发现一个小扰动,可以保留分类者的预测,但改变输入的真实标签时,就会出现顽固的对抗性实例。反向培训和经认证的有力培训在提高机器学习模型的稳健性以扭曲对抗性实例方面显示出一定的效力。我们表明,侧重于降低易受扭曲的对抗性实例的脆弱性的标准对抗性培训方法可能会使模型更容易被固执的对抗性实例所迷惑,同时进行自然语言推断和参数识别任务的实验。为了对付这一现象,我们引入了平衡的对抗性培训,其中纳入了对比性学习,以提高对易碎和顽固的对抗性对抗性范例的稳健性。