Despite the great progress in video understanding made by deep convolutional neural networks, feature representation learned by existing methods may be biased to static visual cues. To address this issue, we propose a novel method to suppress static visual cues (SSVC) based on probabilistic analysis for self-supervised video representation learning. In our method, video frames are first encoded to obtain latent variables under standard normal distribution via normalizing flows. By modelling static factors in a video as a random variable, the conditional distribution of each latent variable becomes shifted and scaled normal. Then, the less-varying latent variables along time are selected as static cues and suppressed to generate motion-preserved videos. Finally, positive pairs are constructed by motion-preserved videos for contrastive learning to alleviate the problem of representation bias to static cues. The less-biased video representation can be better generalized to various downstream tasks. Extensive experiments on publicly available benchmarks demonstrate that the proposed method outperforms the state of the art when only single RGB modality is used for pre-training.
翻译:尽管深相神经网络在视频理解方面取得巨大进展,但通过现有方法获得的特征描述可能偏向静态视觉提示。为了解决这一问题,我们提议了一种基于自我监督的视频演示学习概率分析的抑制静态视觉提示(SSVC)的新颖方法。在我们的方法中,先对视频框进行编码,以便通过正常流动在标准正常分布下获取潜在变量。通过将视频中的静态要素建模为随机变量,每个潜在变量的有条件分布会变换和缩放为正常。然后,随着时间推移而变的隐性变量被选为静态提示,并被抑制以生成运动保护的视频。最后,正对对齐的配对由动态预设视频构建,以通过对比性学习来减轻代表偏重于静态提示的问题。偏差的视频表达方式可以更好地普及到各种下游任务。关于公开可用基准的广泛实验表明,在培训前只使用单一的RGB模式时,拟议方法将超出艺术的状态。