We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision. In contrast to images that capture the static scene appearance, videos also contain sound and temporal scene dynamics. To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting and integrates it with multi-modal contrastive objectives. As temporal self-supervision, we pose playback speed and direction recognition in both modalities and propose intra- and inter-modal temporal ordering tasks. Furthermore, we design a novel contrastive objective in which the usual pairs are supplemented with additional sample-dependent positives and negatives sampled from the evolving feature space. In our model, we apply such losses among video clips and between videos and their temporally corresponding audio clips. We verify our model design in extensive ablation experiments and evaluate the video and audio representations in transfer experiments to action recognition and retrieval on UCF101 and HMBD51, audio classification on ESC50, and robust video fingerprinting on VGG-Sound, with state-of-the-art results.
翻译:我们建议对视频采取自我监督的学习方法,在没有人监督的情况下既学习 RGB 框架的表达方式,也学习随带的音频。与反映静态场景的图像相比,视频还包含声音和时间场景动态。为利用视频固有的时间和时空维度,我们的方法将时间自我监督扩大到视听环境,并将其与多模式对比目标相结合。作为时间自我监督,我们在模式上都呈现回放速度和方向识别,并提出内部和现代间时间订购任务。此外,我们设计了一个新颖的对比目标,在其中,通常的配对配对配以从演变中的地貌空间抽取更多样本的正数和负数。在我们模型中,我们在视频剪辑中以及在视频和视频及其与时间对应的音频剪中应用了这种损失。我们用广泛的反动实验来核查我们的模型设计,并评价视频和音频展示在UCFC101和HMBD51的行动识别和检索实验中的传输速度和声音和声音表现,对ESC50的音频分类,以及在VGG-Sound上进行稳健的视频指纹识别,并取得最新结果。