The landscape of adversarial attacks against text classifiers continues to grow, with new attacks developed every year and many of them available in standard toolkits, such as TextAttack and OpenAttack. In response, there is a growing body of work on robust learning, which reduces vulnerability to these attacks, though sometimes at a high cost in compute time or accuracy. In this paper, we take an alternate approach -- we attempt to understand the attacker by analyzing adversarial text to determine which methods were used to create it. Our first contribution is an extensive dataset for attack detection and labeling: 1.5~million attack instances, generated by twelve adversarial attacks targeting three classifiers trained on six source datasets for sentiment analysis and abuse detection in English. As our second contribution, we use this dataset to develop and benchmark a number of classifiers for attack identification -- determining if a given text has been adversarially manipulated and by which attack. As a third contribution, we demonstrate the effectiveness of three classes of features for these tasks: text properties, capturing content and presentation of text; language model properties, determining which tokens are more or less probable throughout the input; and target model properties, representing how the text classifier is influenced by the attack, including internal node activations. Overall, this represents a first step towards forensics for adversarial attacks against text classifiers.
翻译:对文本分类者的对抗性攻击情况继续扩大,每年都有新的攻击,许多攻击都见诸于TextAttack和OpenAttack等标准工具包。作为回应,关于强力学习的工作越来越多,这降低了对这些攻击的脆弱性,尽管有时在计算时间或准确性方面成本很高。在本文件中,我们采取了另一种办法 -- -- 我们试图通过分析对抗性案文来理解攻击者,以确定使用何种方法来创建攻击者。我们的第一个贡献是为攻击探测和标注而建立的广泛数据集:12个以6个源数据集为对象的对立攻击造成150-百万个攻击事件,这些攻击事件是用英语进行情绪分析和虐待检测的。作为我们的第二个贡献,我们利用这一数据集来制定和确定攻击性识别的分类人员数量 -- -- 确定某一文本是否经过对抗性操纵,以及攻击性。作为第三项贡献,我们展示了这些任务的三个特征的有效性:文字属性、内容捕获和表述文本;语言模型属性,确定哪些标志在投入中多少可能是针对攻击的六个源数据集;作为整个攻击的标本的标尺,如何代表整个攻击的升级。