实习:通过创造和批评性学习建立的一般视频基础模型 (InternVideo: General Video Foundation Models via Generative and Discriminative Learning)

Yi Wang,Kunchang Li,Yizhuo Li,Yinan He,Bingkun Huang,Zhiyu Zhao,Hongjie Zhang,Jilan Xu,Yi Liu,Zun Wang,Sen Xing,Guo Chen,Junting Pan,Jiashuo Yu,Yali Wang,Limin Wang,Yu Qiao

from arxiv, technical report

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

翻译：基金会模型最近展示了计算机愿景中各种下游任务的出色表现,然而,大多数现有视觉基础模型只是侧重于图像层面的预培训和促动,这些只是局限于动态和复杂的视频层面理解任务。为了填补这一空白,我们通过利用基因化和歧视性自我监督的视频学习,展示了普通视频基础模型InternVideo。具体地说,InternVideo有效地探索了隐蔽视频模型和视频语言对比学习作为培训前目标,有选择地以可学习的方式协调这两个互补框架的视频展示,以促进各种视频应用。如果没有钟声和哨声,InternVideo将实现39个视频数据集的最新性表现,这些数据集来自广泛的任务,包括视频行动识别/探测、视频语言校正和开放世界视频应用。特别是,我们的方法在具有挑战性的Kinitics-400和Some-Commining V2基准上可以获得91.1%和77.2%的顶级-1精确度。所有这些结果都有效地展示了我们InterVideo/InterVideo/Dreg.

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日