Adversarial robustness assessment for video recognition models has raised concerns owing to their wide applications on safety-critical tasks. Compared with images, videos have much high dimension, which brings huge computational costs when generating adversarial videos. This is especially serious for the query-based black-box attacks where gradient estimation for the threat models is usually utilized, and high dimensions will lead to a large number of queries. To mitigate this issue, we propose to simultaneously eliminate the temporal and spatial redundancy within the video to achieve an effective and efficient gradient estimation on the reduced searching space, and thus query number could decrease. To implement this idea, we design the novel Adversarial spatial-temporal Focus (AstFocus) attack on videos, which performs attacks on the simultaneously focused key frames and key regions from the inter-frames and intra-frames in the video. AstFocus attack is based on the cooperative Multi-Agent Reinforcement Learning (MARL) framework. One agent is responsible for selecting key frames, and another agent is responsible for selecting key regions. These two agents are jointly trained by the common rewards received from the black-box threat models to perform a cooperative prediction. By continuously querying, the reduced searching space composed of key frames and key regions is becoming precise, and the whole query number becomes less than that on the original video. Extensive experiments on four mainstream video recognition models and three widely used action recognition datasets demonstrate that the proposed AstFocus attack outperforms the SOTA methods, which is prevenient in fooling rate, query number, time, and perturbation magnitude at the same.
翻译:视频识别模型的Adversarial稳健性评估已经引起了人们的关切,因为视频在安全关键任务上应用面很广。与图像相比,视频具有很高的广度,在生成对抗视频时会产生巨大的计算成本。这对于基于询问的黑盒袭击尤其严重,因为通常使用威胁模型的梯度估计,高度将引发大量询问。为了缓解这一问题,我们提议同时消除视频中的时间和空间冗余,以便实现对搜索空间减少的高效梯度估计,从而降低查询次数。为了落实这一理念,我们设计了新型的Aversarial空间时空焦点(AstFus)视频袭击,在视频中同时聚焦的关键框架和关键区域进行袭击,同时同时使用威胁模型和图像模型内部区域进行袭击,Asblical Focus(Astforminality Focus Focus Focal Focal Formation), 不断在搜索之前的视频模型中进行测试, 并持续地显示整个视频模型的准确度数据, 正在逐步测量。