Deep neural networks have been shown to be vulnerable to small perturbations of their inputs, known as adversarial attacks. In this paper, we investigate the vulnerability of Neural Machine Translation (NMT) models to adversarial attacks and propose a new attack algorithm called TransFool. To fool NMT models, TransFool builds on a multi-term optimization problem and a gradient projection step. By integrating the embedding representation of a language model, we generate fluent adversarial examples in the source language that maintain a high level of semantic similarity with the clean samples. Experimental results demonstrate that, for different translation tasks and NMT architectures, our white-box attack can severely degrade the translation quality while the semantic similarity between the original and the adversarial sentences stays high. Moreover, we show that TransFool is transferable to unknown target models. Finally, based on automatic and human evaluations, TransFool leads to improvement in terms of success rate, semantic similarity, and fluency compared to the existing attacks both in white-box and black-box settings. Thus, TransFool permits us to better characterize the vulnerability of NMT models and outlines the necessity to design strong defense mechanisms and more robust NMT systems for real-life applications.
翻译:深心神经网络被证明很容易受到其投入(称为对抗性攻击)的小扰动干扰。在本文中,我们调查神经机器翻译(NMT)模型易受对抗性攻击的脆弱性,并提出一个新的攻击算法,称为TransFool。为了愚弄NMT模型,TransFool基于一个多期优化问题和一个梯度投影步骤。通过整合语言模型的嵌入代表,我们在源语言中生成了流畅的对抗性例子,与清洁样本保持高度的语义相似性。实验结果显示,对于不同的翻译任务和NMT结构,我们的白箱攻击可以严重降低翻译质量,而原始和对抗性判决之间的语义相似性则保持高水平。此外,我们证明TransFool可以转换到未知的目标模型。最后,根据自动和人文评价,TransFool导致在成功率、语义相似性和流畅度方面得到改善,与白箱和黑箱环境中的现有攻击相比,两者的语义性相似性都表明,我们的白箱和黑箱结构可以严重降低翻译质量质量,而原始和对抗性相似性相似性相似性相似性相似性相似性相似性相似性相似性相似性相似性相似性也使我们得以更好地确定NMTMTMT型设计系统的脆弱性。