Neural machine translation systems are known to be vulnerable to adversarial test inputs, however, as we show in this paper, these systems are also vulnerable to training attacks. Specifically, we propose a poisoning attack in which a malicious adversary inserts a small poisoned sample of monolingual text into the training set of a system trained using back-translation. This sample is designed to induce a specific, targeted translation behaviour, such as peddling misinformation. We present two methods for crafting poisoned examples, and show that only a tiny handful of instances, amounting to only 0.02% of the training set, is sufficient to enact a successful attack. We outline a defence method against said attacks, which partly ameliorates the problem. However, we stress that this is a blind-spot in modern NMT, demanding immediate attention.
翻译:众所周知,神经机器翻译系统很容易受到对抗性测试输入,然而,正如我们在本文中所表明的那样,这些系统也容易受到训练攻击。具体地说,我们提议进行中毒攻击,恶意敌人在训练中使用反译法的系统训练组中插入少量单语文字的有毒样本。这个样本旨在诱导一种特定的、有针对性的翻译行为,例如兜售错误信息。我们提出了两种方法来编篡有毒的例子,并表明只有为数不多的几例(仅占训练组的0.02%)足以实施成功的攻击。我们概述了一种防御方法来对付上述攻击,这在一定程度上缓解了问题。然而,我们强调这是现代NMT的盲点,需要立即关注。