从自然脚本知识中学习可转移的平时代表 (Learning Transferable Spatiotemporal Representations from Natural Script Knowledge)

Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We argue that it is due to the fact that they only capture pixel-level knowledge rather than spatiotemporal commonsense, which is far away from cognition-level video understanding. Inspired by the great success of image-text pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. We introduce a new pretext task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR scripts by attending to learned video representations. We do not rely on descriptive captions and learn purely from video, i.e., leveraging the natural transcribed speech knowledge to provide noisy but useful semantics over time. Furthermore, rather than the simple concept learning in vision-caption contrast, we encourage cognition-level temporal commonsense reasoning via narrative reorganization. The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world. Note that our method differs from ones designed for video-text alignment (e.g., Frozen) and multimodal representation learning (e.g., Merlot). Our method demonstrates strong out-of-the-box spatiotemporal representations on diverse video benchmarks, e.g., +13.6% gains over VideoMAE on SSV2 via linear probing.

翻译：大规模视频数据培训前已成为近年来学习可转移的时空表达方式的常见秘诀。尽管取得了一些进展,但现有方法大多限于高度整理的数据集(如K400),并表现出令人不满意的框外表达方式。我们争辩说,这是因为它们只捕捉像素级知识,而不是波地平时普通知识,远远远离视觉层面的视频理解。受图像文本直线定位预培训(如CLIP)的巨大成功启发,我们迈出了第一步,利用语言表达方式来推动可转移的瞬间代表方式学习。我们引入了新的借口任务,转向视频缩略图排序(TTTTTS),通过参加学习视频演示,使ASR脚本变得令人震撼动。我们并不依赖描述性字幕,而纯粹从视频中学习,例如:利用自然调频调的语音定位调校准,但有用的语义比以往更强。此外,我们不是通过在视觉分析中进行简单的概念学习,而是在真实的变现中进行时间变换,我们鼓励将人类的变现方法变成大的变现,使我们的变现式数据法化。我们更能的变化世界的变现,我们更能的变化了。我们更能的变现的变现的变现的变化了。我们更能的变现的变现的变现的变现的变换了人类的变现的变现的变形方法使我们的变现的变现的变现的变的变的变的变的变的变的变的变的变的变的变的变的变式方法使我们的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变。