The objective of this paper is visual-only self-supervised video representation learning. We make the following contributions: (i) we investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (InfoNCE) training, showing that this form of supervised contrastive learning leads to a clear improvement in performance; (ii) we propose a novel self-supervised co-training scheme to improve the popular infoNCE loss, exploiting the complementary information from different views, RGB streams and optical flow, of the same data source by using one view to obtain positive class samples for the other; (iii) we thoroughly evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval. In both cases, the proposed approach demonstrates state-of-the-art or comparable performance with other self-supervised approaches, whilst being significantly more efficient to train, i.e. requiring far less training data to achieve similar performance.
翻译:本文的目的是通过视觉方式进行自我监督的视频演示学习,我们做出以下贡献:(一) 我们调查将语义类正数添加到基于实例的Info Indio Noise Contrastical Expressation(InfoNCE)培训中的语义类正数(InfoNCE)的好处,表明这种受监督的对比学习形式导致业绩的明显改善;(二) 我们提出一个新的自我监督共同培训计划,以改善流行的NECE损失,利用来自不同观点、RGB流和光学流的同一数据源的补充信息,利用一种观点为另一个观点获取正数类样本;(三) 我们彻底评估在两种不同的下游任务(行动识别和视频检索)上所学会的表述质量。 在这两种情况下,拟议的方法都显示了与其他自监督方法相比,最新或可比的绩效,而培训效率要高得多,也就是说,为取得类似绩效而需要的培训数据要少得多。