Recent self-supervised video representation learning methods focus on maximizing the similarity between multiple augmented views from the same video and largely rely on the quality of generated views. However, most existing methods lack a mechanism to prevent representation learning from bias towards static information in the video. In this paper, we propose frequency augmentation (FreqAug), a spatio-temporal data augmentation method in the frequency domain for video representation learning. FreqAug stochastically removes specific frequency components from the video so that learned representation captures essential features more from the remaining information for various downstream tasks. Specifically, FreqAug pushes the model to focus more on dynamic features rather than static features in the video via dropping spatial or temporal low-frequency components. To verify the generality of the proposed method, we experiment with FreqAug on multiple self-supervised learning frameworks along with standard augmentations. Transferring the improved representation to five video action recognition and two temporal action localization downstream tasks shows consistent improvements over baselines.
翻译:最近自我监督的视频代表制学习方法侧重于尽量扩大同一视频的多重强化观点之间的相似性,并在很大程度上依赖生成观点的质量。然而,大多数现有方法缺乏一种机制来防止代表制学习偏向视频中的静态信息。在本文中,我们提议在视频代表制学习的频率域内采用频度增强(FreqAug),即时空数据增强方法,在视频代表制学习的频度范围内采用一个spatio数据增强方法。FreqAug系统从视频中删除了特定的频率组件,以便从各种下游任务剩余信息中更多地了解代表制的基本特征。具体地说,FreqAug将模型的焦点更多地放在动态特征上,而不是通过降低空间或低时空频率组件在视频中的静态特征上。为了核实拟议方法的一般性,我们与FreqAug一起试验多个自我监督的学习框架以及标准的增强。将改进的表达式转换为5个视频行动识别和2个时间动作本地化下游任务显示比基线一致的改进。