Labeling videos at scale is impractical. Consequently, self-supervised visual representation learning is key for efficient video analysis. Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge. However, when applied to real-world videos, contrastive learning may unknowingly lead to the separation of instances that contain semantically similar events. In our work, we introduce a cooperative variant of contrastive learning to utilize complementary information across views and address this issue. We use data-driven sampling to leverage implicit relationships between multiple input video views, whether observed (e.g. RGB) or inferred (e.g. flow, segmentation masks, poses). We are one of the firsts to explore exploiting inter-instance relationships to drive learning. We experimentally evaluate our representations on the downstream task of action recognition. Our method achieves competitive performance on standard benchmarks (UCF101, HMDB51, Kinetics400). Furthermore, qualitative experiments illustrate that our models can capture higher-order class relationships.
翻译:因此,自我监督的视觉形象学习是高效视频分析的关键。最近学习图像表现的成功表明,对比性学习是应对这一挑战的一个很有希望的框架。然而,在应用到真实世界的视频时,对比性学习可能无意中导致包含语义相似事件的情况分离。在我们的工作中,我们引入了一个对比性学习的合作变量,以便利用各种观点之间的互补信息并解决这一问题。我们利用数据驱动抽样来利用多种输入视频视图之间的隐含关系,无论是观察到的(如RGB)还是推断的(如流动、分割面罩、面罩) 。我们是探索利用内部关系推动学习的第一批人之一。我们实验性地评估了我们关于行动认知的下游任务的陈述。我们的方法在标准基准(UCF101、HMDB51、Kineticatics400)上实现了竞争性业绩。此外,定性实验还表明,我们的模型可以捕捉到更高层次的阶级关系。