In light of the success of contrastive learning in the image domain, current self-supervised video representation learning methods usually employ contrastive loss to facilitate video representation learning. When naively pulling two augmented views of a video closer, the model however tends to learn the common static background as a shortcut but fails to capture the motion information, a phenomenon dubbed as background bias. This bias makes the model suffer from weak generalization ability, leading to worse performance on downstream tasks such as action recognition. To alleviate such bias, we propose Foreground-background Merging (FAME) to deliberately compose the foreground region of the selected video onto the background of others. Specifically, without any off-the-shelf detector, we extract the foreground and background regions via the frame difference and color statistics, and shuffle the background regions among the videos. By leveraging the semantic consistency between the original clips and the fused ones, the model focuses more on the foreground motion pattern and is thus more robust to the background context. Extensive experiments demonstrate that FAME can significantly boost the performance in different downstream tasks with various backbones. When integrated with MoCo, FAME reaches 84.8% and 53.5% accuracy on UCF101 and HMDB51, respectively, achieving the state-of-the-art performance.
翻译:鉴于图像领域对比性学习的成功,当前自我监督的视频代表学习方法通常会采用对比性损失来便利视频代表学习。当天真地拉动两个放大的视频更接近的视图时,模型倾向于将共同静态背景作为捷径学习,但未能捕捉运动信息,这是一种被称为背景偏差的现象。这种偏差使模型受到一般化能力薄弱的影响,导致在诸如行动识别等下游任务上表现更差。为了减轻这种偏差,我们提议FREAME(FAME)故意将所选视频的地表层区域混为他人的背景。具体地说,在没有任何现成的探测器的情况下,我们通过框架差异和颜色统计来提取前方和背景区域,并冲刷视频中的背景区域。通过利用原始剪辑和组合的剪辑之间的语义一致性,模型更侧重于前地运动模式,因此对背景环境更有利。广泛的实验表明,FAMEM(FAME)可以大大提升不同下游任务的实绩,与UCO-MDF-51和8的精确度分别达到MFC-M-MF-M-M-M-M-M-M-M-M-84。