We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.
翻译:我们展示了视频CLIP,这是在培训前采用统一模式进行零光视频和文本理解的对比性做法,没有使用下游任务的任何标签。视频CLIP对视频和文本进行了变压器培训,将时间重叠的正对视频文本与近邻检索的硬负对进行了对比。我们在一系列不同的下游任务方面的实验,包括顺序级文字视频检索、视频QA、象征性行动定位和行动分解,显示了最新业绩,超过了以往的工作,在某些情况下甚至超过了监督的方法。代码可在https://github.com/pytorch/fairseq/examples/MMPT上查阅。