Most successful self-supervised learning methods are trained to align the representations of two independent views from the data. State-of-the-art methods in video are inspired by image techniques, where these two views are similarly extracted by cropping and augmenting the resulting crop. However, these methods miss a crucial element in the video domain: time. We introduce BraVe, a self-supervised learning framework for video. In BraVe, one of the views has access to a narrow temporal window of the video while the other view has a broad access to the video content. Our models learn to generalise from the narrow view to the general content of the video. Furthermore, BraVe processes the views with different backbones, enabling the use of alternative augmentations or modalities into the broad view such as optical flow, randomly convolved RGB frames, audio or their combinations. We demonstrate that BraVe achieves state-of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks including UCF101, HMDB51, Kinetics, ESC-50 and AudioSet.
翻译:最成功的自我监督学习方法经过培训,使数据中两种独立观点的表达方式保持一致。视频中最先进的方法受到图像技术的启发,这两种观点通过作物种植和增殖作物以类似的方式产生,但是,这些方法在视频领域缺少一个关键要素:时间。我们引入了自我监督的视频学习框架BraVe。在BraVe,其中一种观点可以访问视频的狭窄时间窗口,而另一种观点则可以广泛访问视频内容。我们的模型学习从狭小的视角到视频的一般内容。此外,BraVe用不同的骨架处理这些观点,以便能够使用替代的增强或模式,将光流、随机混杂的 RGB框架、音频或组合等广泛观点纳入。我们证明,BraVe在自我监督的视频和音频分类基准(包括UCF101、HMDB51、Kineticatics、ESC-50和音频Set)的学习中取得了最新结果。