We have seen a great progress in video action recognition in recent years. There are several models based on convolutional neural network (CNN) with some recent transformer based approaches which provide state-of-the-art performance on existing benchmark datasets. However, large-scale robustness has not been studied for these models which is a critical aspect for real-world applications. In this work we perform a large-scale robustness analysis of these existing models for video action recognition. We mainly focus on robustness against distribution shifts due to real-world perturbations instead of adversarial perturbations. We propose four different benchmark datasets, HMDB-51P, UCF-101P, Kinetics-400P, and SSv2P and study the robustness of six different state-of-the-art action recognition models against 90 different perturbations. The study reveals some interesting findings, 1) transformer based models are consistently more robust against most of the perturbations when compared with CNN based models, 2) Pretraining helps Transformer based models to be more robust to different perturbations than CNN based models, and 3) All of the studied models are robust to temporal perturbation on the Kinetics dataset, but not on SSv2; this suggests temporal information is much more important for action label prediction on SSv2 datasets than on the Kinetics dataset. We hope that this study will serve as a benchmark for future research in robust video action recognition. More details about the project are available at https://rose-ar.github.io/.
翻译:近些年来,我们在视频行动识别方面看到在视频行动识别方面取得了巨大进展。 我们主要关注基于共变神经网络(CNN)的几种模型,最近有一些基于变压器的方法,为现有基准数据集提供最先进的性能。然而,尚未对这些模型进行大规模稳健性研究,这些模型对于真实世界应用来说是一个至关重要的方面。我们在这项工作中对这些现有的视频行动识别模型进行了大规模稳健性分析。我们主要侧重于抵御真实世界的视频突变而不是对抗性突变造成的分布变化。我们提出了四种不同的基准数据集,HMDB-51P、UCF-101P、Kiniticals-400P和SSV2P, 并研究6个不同的状态行动识别模型的稳健性。我们的研究揭示了一些有趣的发现:(1) 与CNN的视频行动模型相比,基于变压器的模型比大多数的扰动性模型更加强大。 预设型模型帮助变压模型比CNNMDB-51P、UC-101P、Kinetical-P、K-Tal-Ial 预估测数据模型更可靠。