In video action recognition, shortcut static features can interfere with the learning of motion features, resulting in poor out-of-distribution (OOD) generalization. The video background is clearly a source of static bias, but the video foreground, such as the clothing of the actor, can also provide static bias. In this paper, we empirically verify the existence of foreground static bias by creating test videos with conflicting signals from the static and moving portions of the video. To tackle this issue, we propose a simple yet effective technique, StillMix, to learn robust action representations. Specifically, StillMix identifies bias-inducing video frames using a 2D reference network and mixes them with videos for training, serving as effective bias suppression even when we cannot explicitly extract the source of bias within each video frame or enumerate types of bias. Finally, to precisely evaluate static bias, we synthesize two new benchmarks, SCUBA for static cues in the background, and SCUFO for static cues in the foreground. With extensive experiments, we demonstrate that StillMix mitigates both types of static bias and improves video representations for downstream applications.
翻译:在视频动作识别中,快捷静态特征可能会干扰运动特征的学习,导致在超出数据分布(OOD)时的泛化能力较差。视频背景显然是静态偏差的来源,但是,视频前景(例如演员的衣服)也可能提供静态偏差。在本文中,我们通过创建具有来自视频静态和移动部分的冲突信号的测试视频来实证验证前景静态偏差的存在。为了应对这个问题,我们提出了一种简单而有效的技术StillMix来学习鲁棒的动作表示。具体地,StillMix使用2D参考网络识别具有偏见的视频帧,并将其与训练用的视频混合,即使我们不能显式地提取每个视频帧内的偏差源或枚举偏差类型,也可以作为有效的偏差抑制。最后,为了精确评估静态偏差,我们合成了两个新基准,其中SCUBA用于背景中的静态提示,而SCUFO用于前景中的静态提示。通过广泛的实验,我们证明StillMix缓解了两种静态偏差,并改善了用于下游应用的视频表示。