We propose the first general-purpose gradient-based attack against transformer models. Instead of searching for a single adversarial example, we search for a distribution of adversarial examples parameterized by a continuous-valued matrix, hence enabling gradient-based optimization. We empirically demonstrate that our white-box attack attains state-of-the-art attack performance on a variety of natural language tasks. Furthermore, we show that a powerful black-box transfer attack, enabled by sampling from the adversarial distribution, matches or exceeds existing methods, while only requiring hard-label outputs.
翻译:我们建议了第一次针对变压器的通用梯度攻击。 我们不是寻找单一的对抗性例子,而是寻找一种以连续价值矩阵参数为参数的对抗性例子的分布,从而促成基于梯度的优化。 我们从经验上表明,我们的白箱攻击在各种自然语言任务上达到了最先进的攻击性效果。 此外,我们显示了一种强大的黑箱转移攻击,通过对对立分布的取样而促成,匹配或超过现有方法,而只需要硬标签输出。