This work examines the vulnerability of multimodal (image + text) models to adversarial threats similar to those discussed in previous literature on unimodal (image- or text-only) models. We introduce realistic assumptions of partial model knowledge and access, and discuss how these assumptions differ from the standard "black-box"/"white-box" dichotomy common in current literature on adversarial attacks. Working under various levels of these "gray-box" assumptions, we develop new attack methodologies unique to multimodal classification and evaluate them on the Hateful Memes Challenge classification task. We find that attacking multiple modalities yields stronger attacks than unimodal attacks alone (inducing errors in up to 73% of cases), and that the unimodal image attacks on multimodal classifiers we explored were stronger than character-based text augmentation attacks (inducing errors on average in 45% and 30% of cases, respectively).
翻译:这项工作考察了多式联运(图像+文本)模式在对抗性威胁面前的脆弱性,类似于以往关于单一方式(图像-或仅文本)模式的文献中所讨论的那样。我们引入了部分模型知识和准入的现实假设,并讨论了这些假设与当前关于对抗性攻击的文献中常见的标准“黑盒子”/“白盒子”二分法有何不同。根据这些“灰盒”假设的不同层次,我们开发了多式联运分类所特有的新的攻击方法,并评估了仇恨美梅斯挑战分类任务。我们发现,攻击多种模式与单是单是单一方式攻击相比,其袭击更强大(造成高达73%的错误),而我们所探索的对多式联运分类者的单一方式图像袭击比基于字符的文字扩增攻击(分别造成45%和30%的平均错误)。