Despite great success on many machine learning tasks, deep neural networks are still vulnerable to adversarial samples. While gradient-based adversarial attack methods are well-explored in the field of computer vision, it is impractical to directly apply them in natural language processing due to the discrete nature of text. To bridge this gap, we propose a general framework to adapt existing gradient-based methods to craft textual adversarial samples. In this framework, gradient-based continuous perturbations are added to the embedding layer and are amplified in the forward propagation process. Then the final perturbed latent representations are decoded with a mask language model head to obtain potential adversarial samples. In this paper, we instantiate our framework with \textbf{T}extual \textbf{P}rojected \textbf{G}radient \textbf{D}escent (\textbf{TPGD}). We conduct comprehensive experiments to evaluate our framework by performing transfer black-box attacks on BERT, RoBERTa and ALBERT on three benchmark datasets. Experimental results demonstrate our method achieves an overall better performance and produces more fluent and grammatical adversarial samples compared to strong baseline methods. All the code and data will be made public.
翻译:尽管在许多机器学习任务上取得了巨大成功,但深神经网络仍然易受对抗性样本的影响。虽然基于梯度的对抗性攻击方法在计算机视觉领域得到了很好的探索,但由于文本的离散性质,直接在自然语言处理中应用这些方法是不切实际的。为了缩小这一差距,我们提议了一个总体框架,使现有的基于梯度的方法适应于编造文本对抗性样品。在这个框架中,基于梯度的连续扰动被添加到嵌入层中,并在前方传播过程中被放大。随后,最后的受扰动的潜伏演示被用面具语言模型头解码,以获得潜在的对抗性攻击样品。在本文件中,我们用\ textbf{T}extf{Textf{P} 将我们的框架直接应用于自然语言处理。为了缩小这一差距,我们建议了一个总体框架框架框架框架,通过对BERT、ROBERTA和ALBERT 三个基准数据集进行传输黑箱攻击来评估。 实验结果将显示我们的方法与更强的公共基准模型相比将产生更好的业绩和制式。