决策型黑盒补丁攻击在视频识别中的有效性研究 (Efficient Decision-based Black-box Patch Attacks on Video Recognition)

Although Deep Neural Networks (DNNs) have demonstrated excellent performance, they are vulnerable to adversarial patches that introduce perceptible and localized perturbations to the input. Generating adversarial patches on images has received much attention, while adversarial patches on videos have not been well investigated. Further, decision-based attacks, where attackers only access the predicted hard labels by querying threat models, have not been well explored on video models either, even if they are practical in real-world video recognition scenes. The absence of such studies leads to a huge gap in the robustness assessment for video models. To bridge this gap, this work first explores decision-based patch attacks on video models. We analyze that the huge parameter space brought by videos and the minimal information returned by decision-based models both greatly increase the attack difficulty and query burden. To achieve a query-efficient attack, we propose a spatial-temporal differential evolution (STDE) framework. First, STDE introduces target videos as patch textures and only adds patches on keyframes that are adaptively selected by temporal difference. Second, STDE takes minimizing the patch area as the optimization objective and adopts spatialtemporal mutation and crossover to search for the global optimum without falling into the local optimum. Experiments show STDE has demonstrated state-of-the-art performance in terms of threat, efficiency and imperceptibility. Hence, STDE has the potential to be a powerful tool for evaluating the robustness of video recognition models.

翻译：尽管深度神经网络 (DNN) 表现出了卓越的性能，但它们容易受到对输入引入可感知的局部扰动的对抗补丁的攻击。生成图像上的对抗补丁已经受到了广泛的关注，但对视频上的对抗补丁进行研究还不充分。此外，在视频模型上，决策型攻击只通过查询威胁模型返回的预测硬标签来实现，但也没有得到很好的研究。这种决策型攻击在真实世界的视频识别场景中很实用，但缺乏这样的研究导致了视频模型的鲁棒性评估中的巨大差距。为了填补这一差距，本文首先探索了视频模型上的决策型补丁攻击。我们分析了视频带来的巨大参数空间和决策型模型返回的最小信息都极大地增加了攻击难度和查询负担。为了实现一个查询效率高的攻击，我们提出了一种时空差分进化 (STDE) 框架。首先，STDE 将目标视频作为补丁纹理，并只在由时差自适应选择的关键帧上增加补丁。其次，STDE 同时采用空间和时间的变异和交叉，以将最小化补丁区域作为优化目标，并搜索全局最优解以避免陷入局部最优解。实验表明，STDE 在威胁、效率和难以察觉性方面表现出了最先进的性能。因此，STDE 有潜力成为评估视频识别模型鲁棒性的强大工具。