We propose a novel gradient-based attack against transformer-based language models that searches for an adversarial example in a continuous space of token probabilities. Our algorithm mitigates the gap between adversarial loss for continuous and discrete text representations by performing multi-step quantization in a quantization-compensation loop. Experiments show that our method significantly outperforms other approaches on various natural language processing (NLP) tasks.
翻译:我们建议对以变压器为基础的语言模型进行新的梯度攻击,在连续的象征性概率空间中寻找对抗性范例。 我们的算法通过在量化-补偿循环中进行多步量化,缩小了连续和离散文本表达的对抗性损失之间的差距。 实验表明,我们的方法在各种自然语言处理(NLP)任务上明显优于其他方法。