Deep neural networks for video action recognition easily learn to utilize shortcut static features, such as background and objects instead of motion features. This results in poor generalization to atypical videos such as soccer playing on concrete surfaces (instead of soccer fields). However, due to the rarity of out-of-distribution (OOD) data, quantitative evaluation of static bias remains a difficult task. In this paper, we synthesize new sets of benchmarks to evaluate static bias of action representations, including SCUB for static cues in the background, and SCUF for static cues in the foreground. Further, we propose a simple yet effective video data augmentation technique, StillMix, that automatically identifies bias-inducing video frames; unlike similar augmentation techniques, StillMix does not need to enumerate or precisely segment biased content. With extensive experiments, we quantitatively compare and analyze existing action recognition models on the created benchmarks to reveal their characteristics. We validate the effectiveness of StillMix and show that it improves TSM (Lin, Gan, and Han 2021) and Video Swin Transformer (Liu et al. 2021) by more than 10% of accuracy on SCUB for OOD action recognition.
翻译:用于视频动作识别的深神经网络很容易地学会使用即时静态特征,如背景和物体而不是运动特征。这导致对非典型视频的不甚一般化,如足球在混凝土表面(而不是足球场)玩耍等(足球场),然而,由于分配外(OOOD)数据的罕见性,静态偏差的定量评估仍是一项艰巨任务。在本文件中,我们综合了一套新的基准,以评价行动代表的静态偏差,包括背景静态提示的SCUB和前台静态提示的SCUF。此外,我们提议一种简单而有效的视频数据增强技术,StillMix, 自动识别产生偏差的视频框架;与类似的增强技术不同, StillMix 不需要罗列或精确的段偏差内容。通过广泛的实验,我们从数量上比较和分析关于设定基准的现有行动识别模型,以揭示其特征。我们验证了StillMix的效力,并表明SMM(Lin、Gan和Han 2021)和视频Swin变压器(Liu et al. 2021)的精确度超过10%的SUB用于OOD行动识别的精确度。