Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications. A recent line of work, detection-based defense, aims to distinguish adversarial sentences from benign ones. However, {the core limitation of previous detection methods is being incapable of giving correct predictions on adversarial sentences unlike defense methods from other paradigms.} To solve this issue, this paper proposes TextShield: (1) we discover a link between text attack and saliency information, and then we propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not. (2) We design a saliency-based corrector, which converts the detected adversary sentences to benign ones. By combining the saliency-based detector and corrector, TextShield extends the detection-only paradigm to a detection-correction paradigm, thus filling the gap in the existing detection-based defense. Comprehensive experiments show that (a) TextShield consistently achieves higher or comparable performance than state-of-the-art defense methods across various attacks on different benchmarks. (b) our saliency-based detector outperforms existing detectors for detecting adversarial sentences.
翻译:为了解决这个问题,本文提议Text Shield : (a) 我们发现文本攻击和突出信息之间的联系,然后我们提议一个突出的检测器,能够有效地检测输入句是否为对立的。 (2) 我们设计一个基于显著的校正器,将检测到的对立句转换为良性。通过将基于显著的检测器和校正器相结合,TextShield将仅检测的范式扩大到检测-校正范式,从而填补现有基于检测的防御差距。全面实验显示:(a) 文本Shield在不同基准的攻击中持续取得高于或可比较的状态式防范方法。