Recent work has demonstrated the vulnerability of modern text classifiers to universal adversarial attacks, which are input-agnostic sequences of words added to text processed by classifiers. Despite being successful, the word sequences produced in such attacks are often ungrammatical and can be easily distinguished from natural text. We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems when added to benign inputs. We leverage an adversarially regularized autoencoder (ARAE) to generate triggers and propose a gradient-based search that aims to maximize the downstream classifier's prediction loss. Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models as per automatic detection metrics and human-subject studies. Our aim is to demonstrate that adversarial attacks can be made harder to detect than previously thought and to enable the development of appropriate defenses.
翻译:最近的工作表明,现代文本分类者容易受到普遍对抗性攻击,即分类者处理的文本中增加的输入-不可知的词序列。尽管取得了成功,但这类攻击产生的字序列往往没有语法,很容易与自然文本区分开来。我们开发的对抗性攻击似乎更接近自然的英语词组,但在加入良性投入时却混淆了分类系统。我们利用对抗性正规化自动编码者(ARAE)来产生触发器,并提出基于梯度的搜索,目的是最大限度地增加下游分类者的预测损失。我们的攻击有效地降低了分类任务模型的准确性,而根据自动检测指标和人类主题研究,这种模型的识别性比先前的模型要小。我们的目的是证明对抗性攻击比以前想象的要难于探测,并且能够发展适当的防御。