Self-supervised learning has shown great potentials in improving the video representation ability of deep neural networks by getting supervision from the data itself. However, some of the current methods tend to cheat from the background, i.e., the prediction is highly dependent on the video background instead of the motion, making the model vulnerable to background changes. To mitigate the model reliance towards the background, we propose to remove the background impact by adding the background. That is, given a video, we randomly select a static frame and add it to every other frames to construct a distracting video sample. Then we force the model to pull the feature of the distracting video and the feature of the original video closer, so that the model is explicitly restricted to resist the background influence, focusing more on the motion changes. We term our method as \emph{Background Erasing} (BE). It is worth noting that the implementation of our method is so simple and neat and can be added to most of the SOTA methods without much efforts. Specifically, BE brings 16.4% and 19.1% improvements with MoCo on the severely biased datasets UCF101 and HMDB51, and 14.5% improvement on the less biased dataset Diving48.
翻译:自我监督的学习表明,通过从数据本身获得监督,在提高深神经网络的视频代表能力方面,存在巨大的潜力。然而,目前的一些方法往往从背景中欺骗,即预测高度依赖视频背景而不是运动,使模型容易受到背景变化的影响。为了减轻模型对背景的依赖,我们建议通过添加背景来消除背景影响。也就是说,如果有视频,我们随机选择一个静态框架,并将其添加到其他每个框架中,以构建一个分散注意力的视频样本。然后,我们强迫模型拉动分散视频的特征和原始视频的特征,以便明确限制模型以抵制背景影响,更多关注运动变化。我们称我们的方法为emph{Backround Erasing}(BEE)。我们的方法非常简单、简洁,可以在不费力的情况下添加到大多数SOTA方法中。具体地说,BE带来16.4%和19.1%的改进,在严重偏差的 UCFFC 101 和 HMD51 数据上改进了14.5。