While vision-language pre-training model (VLP) has shown revolutionary improvements on various vision-language (V+L) tasks, the studies regarding its adversarial robustness remain largely unexplored. This paper studied the adversarial attack on popular VLP models and V+L tasks. First, we analyzed the performance of adversarial attacks under different settings. By examining the influence of different perturbed objects and attack targets, we concluded some key observations as guidance on both designing strong multimodal adversarial attack and constructing robust VLP models. Second, we proposed a novel multimodal attack method on the VLP models called Collaborative Multimodal Adversarial Attack (Co-Attack), which collectively carries out the attacks on the image modality and the text modality. Experimental results demonstrated that the proposed method achieves improved attack performances on different V+L downstream tasks and VLP models. The analysis observations and novel attack method hopefully provide new understanding into the adversarial robustness of VLP models, so as to contribute their safe and reliable deployment in more real-world scenarios.
 翻译:虽然愿景语言培训前模式(VLP)在各种愿景语言(V+L)任务方面显示出了革命性的改进,但其对抗性强健性研究基本上仍未得到探讨,本文研究了对广受欢迎的VLP模式和V+L任务的对抗性攻击。首先,我们分析了不同环境下对立性攻击的性能。通过审查不同受袭物体和攻击目标的影响,我们得出了一些关键意见,作为设计强大的多式联运对抗性攻击和构建强大的VLP模式的指导。第二,我们提议对VLP模式(合作多式反向攻击(Co-Attack))采用新的多式攻击方法,即对图像模式和文本模式进行集体攻击。实验结果表明,拟议方法改善了对不同V+L下游任务和VLP模式的攻击性攻击性攻击性表现。分析性观察和新式攻击性攻击性方法有望为VLP模式的对抗性攻击性攻击性攻击性能提供新的理解,从而在更真实的情景中安全可靠地部署这些模式。