The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .
翻译:基金会模型最近展示了计算机愿景中各种下游任务的出色表现,然而,大多数现有视觉基础模型只是侧重于图像层面的预培训和促动,这些只是局限于动态和复杂的视频层面理解任务。为了填补这一空白,我们通过利用基因化和歧视性自我监督的视频学习,展示了普通视频基础模型InternVideo。具体地说,InternVideo有效地探索了隐蔽视频模型和视频语言对比学习作为培训前目标,有选择地以可学习的方式协调这两个互补框架的视频展示,以促进各种视频应用。如果没有钟声和哨声,InternVideo将实现39个视频数据集的最新性表现,这些数据集来自广泛的任务,包括视频行动识别/探测、视频语言校正和开放世界视频应用。特别是,我们的方法在具有挑战性的Kinitics-400和Some-Commining V2基准上可以获得91.1%和77.2%的顶级-1精确度。所有这些结果都有效地展示了我们InterVideo/InterVideo/Dreg.