Despite their excellent performance, state-of-the-art computer vision models often fail when they encounter adversarial examples. Video perception models tend to be more fragile under attacks, because the adversary has more places to manipulate in high-dimensional data. In this paper, we find one reason for video models' vulnerability is that they fail to perceive the correct motion under adversarial perturbations. Inspired by the extensive evidence that motion is a key factor for the human visual system, we propose to correct what the model sees by restoring the perceived motion information. Since motion information is an intrinsic structure of the video data, recovering motion signals can be done at inference time without any human annotation, which allows the model to adapt to unforeseen, worst-case inputs. Visualizations and empirical experiments on UCF-101 and HMDB-51 datasets show that restoring motion information in deep vision models improves adversarial robustness. Even under adaptive attacks where the adversary knows our defense, our algorithm is still effective. Our work provides new insight into robust video perception algorithms by using intrinsic structures from the data. Our webpage is available at https://motion4robust.cs.columbia.edu.
翻译:尽管他们表现出色,但最先进的计算机视觉模型在遇到对抗性实例时往往会失败。视频视觉模型在攻击中往往更加脆弱,因为对手在高维数据中有更多的位置可以操作。在本文中,我们发现视频模型脆弱性的一个原因是,他们未能在对抗性干扰下看到正确的动作。广泛证据表明,运动是人类视觉系统的一个关键因素,因此我们提议通过恢复感知动作信息来纠正模型所看到的内容。由于运动信息是视频数据的内在结构,恢复动作信号可以在推断时间进行,而没有人类的笔记,这样可以使模型适应出乎意料的、最坏情况的投入。关于UCF-101和HMDB-51数据集的视觉化和经验实验表明,在深视性模型中恢复运动信息可以提高对抗性强健性。即使在敌人了解我们的防御的适应性攻击下,我们的算法仍然有效。我们的工作通过使用数据固有的结构,对强健的视频认知算法提供了新的洞察力。我们的网页可在https://movection4robust.calima.