Recognizing transformation types applied to a video clip (RecogTrans) is a long-established paradigm for self-supervised video representation learning, which achieves much inferior performance compared to instance discrimination approaches (InstDisc) in recent works. However, based on a thorough comparison of representative RecogTrans and InstDisc methods, we observe the great potential of RecogTrans on both semantic-related and temporal-related downstream tasks. Based on hard-label classification, existing RecogTrans approaches suffer from noisy supervision signals in pre-training. To mitigate this problem, we developed TransRank, a unified framework for recognizing Transformations in a Ranking formulation. TransRank provides accurate supervision signals by recognizing transformations relatively, consistently outperforming the classification-based formulation. Meanwhile, the unified framework can be instantiated with an arbitrary set of temporal or spatial transformations, demonstrating good generality. With a ranking-based formulation and several empirical practices, we achieve competitive performance on video retrieval and action recognition. Under the same setting, TransRank surpasses the previous state-of-the-art method by 6.4% on UCF101 and 8.3% on HMDB51 for action recognition (Top1 Acc); improves video retrieval on UCF101 by 20.4% (R@1). The promising results validate that RecogTrans is still a worth exploring paradigm for video self-supervised learning. Codes will be released at https://github.com/kennymckormick/TransRank.
翻译:视频片段(RecogTrans)应用的认知转换类型是自我监督视频代表学习的长期模式,与最近作品中实例歧视方法(InstDisc)相比,这种学习成绩低得多。然而,根据对代表性RecogTrans和InstDisc方法的彻底比较,我们观察到RecogTrans在语义相关和时间相关下游任务方面的巨大潜力。根据硬标签分类,现有的RecogTrans 方法在培训前的监管信号中受到吵闹。为了缓解这一问题,我们开发了TransRank,这是一个承认排名配方转变的统一框架。 TransRank提供了准确的监督信号,通过识别相对而言的转型,持续地超过基于分类的配方。与此同时,统一的框架可以随着一套任意的时间或空间变异而回,显示出良好的普遍性。基于等级的提法和若干经验实践,我们在视频检索和行动识别上取得了竞争性业绩。在同一背景下, TransRank超越了先前的R-art模式,在UCF-101和8.3Exial Vial 上,在UCR101%的自我学习规则上,在HMDBVILVILA上将改进20B的升级。