Learning visual representations through self-supervision is an extremely challenging task as the network needs to sieve relevant patterns from spurious distractors without the active guidance provided by supervision. This is achieved through heavy data augmentation, large-scale datasets and prohibitive amounts of compute. Video self-supervised learning (SSL) suffers from added challenges: video datasets are typically not as large as image datasets, compute is an order of magnitude larger, and the amount of spurious patterns the optimizer has to sieve through is multiplied several fold. Thus, directly learning self-supervised representations from video data might result in sub-optimal performance. To address this, we propose to utilize a strong image-based model, pre-trained with self- or language supervision, in a video representation learning framework, enabling the model to learn strong spatial and temporal information without relying on the video labeled data. To this end, we modify the typical video-based SSL design and objective to encourage the video encoder to \textit{subsume} the semantic content of an image-based model trained on a general domain. The proposed algorithm is shown to learn much more efficiently (i.e. in less epochs and with a smaller batch) and results in a new state-of-the-art performance on standard downstream tasks among single-modality SSL methods.
翻译:通过自我监督进行学习的视觉表现是一项极具挑战性的任务,因为网络需要在没有监管提供的积极指导的情况下,从虚假的分散器中筛选出相关模式,而不需由监管提供积极的指导。这是通过重数据增强、大规模数据集和令人望而生畏的计算数量来实现的。视频自我监督学习(SSL)面临更多的挑战:视频数据集通常不象图像数据集那么大,计算是一个更大的数量级级,优化者必须筛选的虚假模式数量会乘以几个折叠。因此,直接从视频数据中直接学习自我监督的演示可能会导致亚优性性性表现。为此,我们提议在视频代表学习框架内使用一个强大的基于图像的模式,经过自我监督或语言监督的预先培训,使模型能够学习强大的空间和时间信息,而不必依赖视频标签数据。为此,我们修改了典型的基于视频的SSL设计和目标,以鼓励视频编码器进入\textit{subsubsubsubsubsult。为了解决这个问题,我们提议在普通域中以较小型的方式和较小型的系统上,以较小型的方法学习基于图像模型的内容。