It is known that neural networks are subject to attacks through adversarial perturbations, i.e., inputs which are maliciously crafted through perturbations to induce wrong predictions. Furthermore, such attacks are impossible to eliminate, i.e., the adversarial perturbation is still possible after applying mitigation methods such as adversarial training. Multiple approaches have been developed to detect and reject such adversarial inputs, mostly in the image domain. Rejecting suspicious inputs however may not be always feasible or ideal. First, normal inputs may be rejected due to false alarms generated by the detection algorithm. Second, denial-of-service attacks may be conducted by feeding such systems with adversarial inputs. To address the gap, in this work, we propose an approach to automatically repair adversarial texts at runtime. Given a text which is suspected to be adversarial, we novelly apply multiple adversarial perturbation methods in a positive way to identify a repair, i.e., a slightly mutated but semantically equivalent text that the neural network correctly classifies. Our approach has been experimented with multiple models trained for natural language processing tasks and the results show that our approach is effective, i.e., it successfully repairs about 80\% of the adversarial texts. Furthermore, depending on the applied perturbation method, an adversarial text could be repaired in as short as one second on average.
翻译:众所周知,神经网络通过对抗性扰动受到攻击,即恶意通过干扰制造的投入,以诱发错误的预测;此外,这种攻击是不可能消除的,也就是说,在采用对抗性培训等缓解方法之后,对抗性扰动仍然有可能消除。已经开发了多种方法来探测和拒绝这种对抗性投入,大多是在图像领域。拒绝可疑投入可能并不总是可行或理想的。首先,正常投入可能由于探测算法产生的虚假警报而被拒绝。第二,拒绝服务攻击可以通过向这种系统提供对抗性投入进行。为了弥补这一差距,我们建议了一种在运行时自动修理对抗性文字的方法。鉴于一种怀疑是对抗性的案文,我们新采用了多种对抗性侵入性侵入性投入方法,积极地确定修复,即第二个是稍微变异但具有词性等同的文字,而神经网络则正确分类。我们的方法是实验多种模式,在自然语言处理过程中,我们用经过训练的典型模型来进行试验,在80种对立性文本上进行顺利的修复,一个结果显示我们的一种对立性文本的修复方法。