This paper introduces a novel self-supervised method that leverages incoherence detection for video representation learning. It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos. Specifically, the training sample, denoted as the incoherent clip, is constructed by multiple sub-clips hierarchically sampled from the same raw video with various lengths of incoherence between each other. The network is trained to learn high-level representation by predicting the location and length of incoherence given the incoherent clip as input. Additionally, intra-video contrastive learning is introduced to maximize the mutual information between incoherent clips from the same raw video. We evaluate our proposed method through extensive experiments on action recognition and video retrieval utilizing various backbone networks. Experiments show that our proposed method achieves state-of-the-art performance across different backbone networks and different datasets compared with previous coherence-based methods.
翻译:本文介绍了一种新的自我监督方法,利用不一致的检测方法进行视频演示学习,其基础是观察人类的视觉系统能够根据对视频的全面理解很容易地识别视频的不一致性。具体地说,培训样本(称为不一致性的剪辑)是由同一原始视频的多个次剪片按等级从不同长度不一致性的同一原始视频中逐级抽样构建的。网络接受培训,通过预测不一致性剪辑作为输入的不一致性位置和长度来学习高层次的不一致性。此外,引入了视频内部对比学习,以最大限度地扩大同一原始视频的不一致性剪辑之间的相互信息。我们通过利用各种主干网络对行动识别和视频检索的广泛实验来评估我们拟议的方法。实验表明,我们拟议的方法在不同的主干网和不同的数据组合中取得了最先进的业绩,而与以往的基于一致性的方法相比,我们提出的方法则不同。