The Universal Trigger (UniTrigger) is a recently-proposed powerful adversarial textual attack method. Utilizing a learning-based mechanism, UniTrigger can generate a fixed phrase that when added to any benign inputs, can drop the prediction accuracy of a textual neural network (NN) model to near zero on a target class. To defend against this new attack method that may cause significant harm, we borrow the "honeypot" concept from the cybersecurity community and propose DARCY, a honeypot-based defense framework. DARCY adaptively searches and injects multiple trapdoors into an NN model to "bait and catch" potential attacks. Through comprehensive experiments across five public datasets, we demonstrate that DARCY detects UniTrigger's adversarial attacks with up to 99% TPR and less than 1% FPR in most cases, while showing a difference of only around 2% of F1 score on average in predicting for clean inputs. We also show that DARCY with multiple trapdoors is robust under different assumptions with respect to attackers' knowledge and skills.
翻译:通用触发器( UniTrigger) 是一种最近提出的强大的对抗性文字攻击方法。 使用基于学习的机制, UniTrigger 可以生成一个固定的短语, 当添加到任何良性输入中时, 可以将文本神经网络模型的预测准确性降低到目标等级的接近零。 为了防范这种可能造成重大伤害的新攻击方法, 我们从网络安全界借用了“ 蜂窝” 概念, 并提出了一个基于蜂窝的防御框架DARCY。 DARCY 以适应性方式搜索并输入多个陷阱到NN模型中, 以“ 殴打和抓捕” 潜在攻击。 通过对五个公共数据集的全面实验, 我们证明 DARCY 检测到UniTrigger 的对抗性攻击, 高达99% TPR, 在大多数情况下不到1% FPR, 同时显示在预测清洁投入方面平均只有2% F1 的得分差异。 我们还表明, 在与攻击者的知识和技能有关的不同假设下, 具有多个陷阱的DARCY是强大的。