We have seen a great progress in video action recognition in recent years. There are several models based on convolutional neural network (CNN) and some recent transformer based approaches which provide top performance on existing benchmarks. In this work, we perform a large-scale robustness analysis of these existing models for video action recognition. We focus on robustness against real-world distribution shift perturbations instead of adversarial perturbations. We propose four different benchmark datasets, HMDB51-P, UCF101-P, Kinetics400-P, and SSv2-P to perform this analysis. We study robustness of six state-of-the-art action recognition models against 90 different perturbations. The study reveals some interesting findings, 1) transformer based models are consistently more robust compared to CNN based models, 2) Pretraining improves robustness for Transformer based models more than CNN based models, and 3) All of the studied models are robust to temporal perturbations for all datasets but SSv2; suggesting the importance of temporal information for action recognition varies based on the dataset and activities. Next, we study the role of augmentations in model robustness and present a real-world dataset, UCF101-DS, which contains realistic distribution shifts, to further validate some of these findings. We believe this study will serve as a benchmark for future research in robust video action recognition.
翻译:我们在近年来视频动作识别方面取得了巨大的进展。基于卷积神经网络(CNN)和最近的基于transformer的方法提供了现有基准测试中的最佳表现。在这项工作中,我们对这些现有的视频动作识别模型进行大规模的鲁棒性分析。我们专注于对真实世界分布差异扰动而不是对抗性扰动的鲁棒性。我们提出了四个不同的基准数据集,分别是HMDB51-P,UCF101-P,Kinetics400-P和SSv2-P,以进行此分析。我们研究了六种最先进的动作识别模型对90种不同扰动的鲁棒性。该研究揭示了一些有趣的发现:1)与基于CNN的模型相比,基于transformer的模型始终更为鲁棒;2)预训练对transformer-based模型的鲁棒性的提高优于对CNN-based模型;3)针对所有数据集,但对于SSv2数据集而言,所研究的所有模型都对时间扰动具有鲁棒性;这表明动作识别的时序信息的重要性因数据集和活动而异。接下来,我们研究了增强技术在模型鲁棒性中的作用,并提出了一个真实世界数据集UCF101-DS,其中包含现实的分布变化,以进一步验证这些发现。我们相信本研究将成为未来鲁棒视频动作识别研究的基准。