This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations. In TeG, we sample a long clip from a video and a short clip that lies inside the long clip. We then extract their dense temporal embeddings. The training objective consists of two parts: a fine-grained temporal learning objective to maximize the similarity between corresponding temporal embeddings in the short clip and the long clip, and a persistent temporal learning objective to pull together global embeddings of the two clips. Our study reveals the impact of temporal granularity with three major findings. 1) Different video tasks may require features of different temporal granularities. 2) Intriguingly, some tasks that are widely considered to require temporal awareness can actually be well addressed by temporally persistent features. 3) The flexibility of TeG gives rise to state-of-the-art results on 8 video benchmarks, outperforming supervised pre-training in most cases.
翻译:这项工作提出了一个自我监督的学习框架,名为TeG,以探索学习视频演示中的时间粒子。在TeG中,我们从长片中抽取一段长片和短片中的短片,然后提取密集的时间嵌入。培训目标由两部分组成:细片时间学习目标,以最大限度地扩大短片和长片中相应时间嵌入的相似性,以及长期时间学习目标,以将两个片子的全球嵌入结合起来。我们的研究揭示了时间粒子的影响,有三个主要发现。 1)不同视频任务可能需要不同时间粒子的特点。2 有趣的是,一些被广泛认为需要时间认识的任务实际上可以通过时间持久性特征来很好地解决。 3)TeG的灵活性在8个视频基准中产生了最新的艺术结果,在多数情况下超过了监督的培训前。