In recent years, a significant amount of research efforts concentrated on adversarial attacks on images, while adversarial video attacks have seldom been explored. We propose an adversarial attack strategy on videos, called DeepSAVA. Our model includes both additive perturbation and spatial transformation by a unified optimisation framework, where the structural similarity index (SSIM) measure is adopted to measure the adversarial distance. We design an effective and novel optimisation scheme which alternatively utilizes Bayesian optimisation to identify the most influential frame in a video and Stochastic gradient descent (SGD) based optimisation to produce both additive and spatial-transformed perturbations. Doing so enables DeepSAVA to perform a very sparse attack on videos for maintaining human imperceptibility while still achieving state-of-the-art performance in terms of both attack success rate and adversarial transferability. Our intensive experiments on various types of deep neural networks and video datasets confirm the superiority of DeepSAVA.
翻译:近年来,大量研究工作集中在对图像的对抗性攻击上,而对抗性录像攻击则很少得到探讨。我们建议对视频采取称为DeepSAVA的对抗性攻击战略。我们的模型包括一个统一的优化框架,通过结构相似指数(SSIM)衡量对立性距离,进行添加性扰动和空间转换,同时采用结构相似性指数(SSIM)测量。我们设计了一个有效和新颖的优化计划,利用巴伊西亚的优化,在视频和Sottachistic梯度下降(SGD)的基础上,确定最有影响力的框架,以便产生添加和空间转换的扰动。这样,DeepSAVA就能对视频进行非常稀少的攻击,以保持人类的不易感性,同时在攻击成功率和对抗性转移性方面仍然达到最先进的表现。我们对各种深度神经网络和视频数据集的密集实验证实了DeepSAVA的优越性。