As modern neural machine translation (NMT) systems have been widely deployed, their security vulnerabilities require close scrutiny. Most recently, NMT systems have been found vulnerable to targeted attacks which cause them to produce specific, unsolicited, and even harmful translations. These attacks are usually exploited in a white-box setting, where adversarial inputs causing targeted translations are discovered for a known target system. However, this approach is less viable when the target system is black-box and unknown to the adversary (e.g., secured commercial systems). In this paper, we show that targeted attacks on black-box NMT systems are feasible, based on poisoning a small fraction of their parallel training data. We show that this attack can be realised practically via targeted corruption of web documents crawled to form the system's training data. We then analyse the effectiveness of the targeted poisoning in two common NMT training scenarios: the from-scratch training and the pre-train & fine-tune paradigm. Our results are alarming: even on the state-of-the-art systems trained with massive parallel data (tens of millions), the attacks are still successful (over 50% success rate) under surprisingly low poisoning budgets (e.g., 0.006%). Lastly, we discuss potential defences to counter such attacks.
翻译:由于现代神经机器翻译系统(NMT)被广泛使用,它们的安全弱点需要仔细审查。最近,NMT系统被发现易受有针对性的攻击,导致它们产生具体、未经请求甚至有害的翻译。这些攻击通常在白箱环境中被利用,因为已知的目标系统发现有针对性翻译的对抗性投入。然而,当目标系统是黑箱,对手不知道(例如,安全的商业系统)时,这一方法就不那么可行。在本文中,我们表明对黑箱NMT系统进行有针对性的攻击是可行的,其依据是它们平行培训数据的一小部分中毒。我们表明,这种攻击实际上可以通过有针对性地腐败网络文件来实现,这些网络文件爬入系统的培训数据。我们随后在两种常见的NMT培训情景中分析定向中毒的实效:从垃圾培训和前导和微调范式(例如,安全的商业系统)。我们的结果令人震惊:即使经过大规模平行数据培训的状态系统(10万),攻击仍然成功(超过50 % 成功率) 。最后,在低中毒预算下(令人惊讶地说,我们反对) 6 。