训练高效的视频特征模型：无蒙面教师 (Unmasked Teacher: Towards Training-Efficient Video Foundation Models) - 专知论文

会员服务 ·

0

特征模 · 特征模型 · 视频特征 · 视频 · 图像特征 ·

2023 年 3 月 28 日

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

翻译：训练高效的视频特征模型：无蒙面教师

Kunchang Li,Yali Wang,Yizhuo Li,Yi Wang,Yinan He,Limin Wang,Yu Qiao

from arxiv, 16 pages, 5 figures, 28 tables

Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain. Although VideoMAE has trained a robust ViT from limited data, its low-level reconstruction poses convergence difficulties and conflicts with high-level cross-modal alignment. This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods. To increase data efficiency, we mask out most of the low-semantics video tokens, but selectively align the unmasked tokens with IFM, which serves as the UnMasked Teacher (UMT). By providing semantic guidance, our method enables faster convergence and multimodal friendliness. With a progressive pre-training framework, our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding. Using only public sources for pre-training in 6 days on 32 A100 GPUs, our scratch-built ViT-L/16 achieves state-of-the-art performances on various video tasks. The code and models will be released at https://github.com/OpenGVLab/unmasked_teacher.

翻译：视频特征模型（VFM）由于高计算成本和数据稀缺性，受到了限制。先前的VFM依赖于图像特征模型（IFM），后者在转移到视频领域时面临着挑战。虽然 VideoMAE 通过有限的数据训练出了一个强大的 ViT，但其低级重建会导致收敛困难，以及与高级跨模态对齐产生冲突。本文提出了一种训练高效的方法，用于集成现有方法的优点，以生成具有时态敏感性的视频特征模型。为了提高数据效率，我们掩盖了大部分低语义的视频标记，但是又有选择性地将未掩盖的标记与IFM对齐，后者作为UnMasked Teacher (UMT)。通过提供语义指导，我们的方法使得模型具备更快的收敛速度和多模态友好性。通过逐步预训练框架，我们的模型可以处理各种任务，包括与场景相关、与时间相关和复杂的视频语言理解。在仅用公共数据进行 6 天、在 32 A100 GPU 上的预训练情况下，我们基于零的 ViT-L/16 在各种视频任务上取得了最先进的性能。可以在 https://github.com/OpenGVLab/unmasked_teacher 上发布代码和模型。

0

相关内容

特征模

【ACL2022教程】有限文本数据学习，Learning with Limited Text Data

【ACL2022教程】有限文本数据学习，Learning with Limited Text Data

专知会员服务

29+阅读 · 2022年5月22日

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

专知会员服务

29+阅读 · 2022年3月6日

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

专知会员服务

28+阅读 · 2022年3月3日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

【CVPR2020】从未标记的视频中学习视频对象分割，Learning Video Object Segmentation from Unlabeled Videos

【CVPR2020】从未标记的视频中学习视频对象分割，Learning Video Object Segmentation from Unlabeled Videos

专知会员服务

36+阅读 · 2020年3月12日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

VideoMAE：简单高效的视频自监督预训练新范式｜NeurIPS 2022

VideoMAE：简单高效的视频自监督预训练新范式｜NeurIPS 2022

新智元

0+阅读 · 2022年11月28日

ToMe：我的方法无需训练即可加速 ViT 模型｜搞懂Transformer系列

ToMe：我的方法无需训练即可加速 ViT 模型｜搞懂Transformer系列

极市平台

3+阅读 · 2022年10月25日

李飞飞团队新作MaskViT：用于视频预测的掩码视觉预训练

李飞飞团队新作MaskViT：用于视频预测的掩码视觉预训练

极市平台

0+阅读 · 2022年6月27日

量化金融强化学习论文集合

量化金融强化学习论文集合

专知

14+阅读 · 2019年12月18日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【SIGIR2018】五篇对抗训练文章

【SIGIR2018】五篇对抗训练文章

专知

12+阅读 · 2018年7月9日

Copine VII在阿尔茨海默病中的作用机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向大数据的安全迁移学习方法

国家自然科学基金

28+阅读 · 2015年12月31日

面向多源大数据的鲁棒聚类模型与算法研究

国家自然科学基金

6+阅读 · 2015年12月31日

基于增量式学习的可扩展偏最小二乘模型的研究

国家自然科学基金

0+阅读 · 2014年12月31日

众核处理器上并行稠密矩阵计算关键技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于多示例学习和半监督学习的手势语识别的研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向人与Agent混合的多团队协作仿真训练方法研究

国家自然科学基金

19+阅读 · 2012年12月31日

新型固载多氮杂环席夫碱介孔分子筛的制备及催化性能研究

国家自然科学基金

0+阅读 · 2009年12月31日

多孔超分子负载的纳米金属催化剂制备及其在有机反应中的应用

国家自然科学基金

0+阅读 · 2009年12月31日

单分散纳米晶－多酸纳米团簇主客体化学研究

国家自然科学基金

0+阅读 · 2009年12月31日

VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

Arxiv

0+阅读 · 2023年5月18日

Undercover Deepfakes: Detecting Fake Segments in Videos

Arxiv

0+阅读 · 2023年5月16日

Blind Image Quality Assessment via Transformer Predicted Error Map and Perceptual Quality Token

Arxiv

0+阅读 · 2023年5月16日

Efficient Neural Generation of 4K Masks for Homogeneous Diffusion Inpainting

Arxiv

0+阅读 · 2023年5月16日

Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Arxiv

48+阅读 · 2022年9月7日

MetAug: Contrastive Learning via Meta Feature Augmentation

Arxiv

10+阅读 · 2022年3月10日

Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation

Arxiv

11+阅读 · 2021年12月16日

Generative Models as a Data Source for Multiview Representation Learning

Arxiv

16+阅读 · 2021年6月9日

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

Arxiv

18+阅读 · 2021年4月4日

End-to-End Dense Video Captioning with Masked Transformer

Arxiv

14+阅读 · 2018年4月3日

VIP会员

文章信息

相关主题

相关VIP内容

【ACL2022教程】有限文本数据学习，Learning with Limited Text Data

【ACL2022教程】有限文本数据学习，Learning with Limited Text Data

专知会员服务

29+阅读 · 2022年5月22日

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

专知会员服务

29+阅读 · 2022年3月6日

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

专知会员服务

28+阅读 · 2022年3月3日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

【CVPR2020】从未标记的视频中学习视频对象分割，Learning Video Object Segmentation from Unlabeled Videos

【CVPR2020】从未标记的视频中学习视频对象分割，Learning Video Object Segmentation from Unlabeled Videos

专知会员服务

36+阅读 · 2020年3月12日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】数据驱动决策中的激励、信息与不确定性

DGP双粒度提示框架：图增强大模型助力欺诈检测

【ICCV2025】ESSENTIAL：用于视频类增量学习的情景记忆与语义记忆整合

唯快不破：大型语言模型高效架构综述

相关资讯

VideoMAE：简单高效的视频自监督预训练新范式｜NeurIPS 2022

VideoMAE：简单高效的视频自监督预训练新范式｜NeurIPS 2022

新智元

0+阅读 · 2022年11月28日

ToMe：我的方法无需训练即可加速 ViT 模型｜搞懂Transformer系列

ToMe：我的方法无需训练即可加速 ViT 模型｜搞懂Transformer系列

极市平台

3+阅读 · 2022年10月25日

李飞飞团队新作MaskViT：用于视频预测的掩码视觉预训练

李飞飞团队新作MaskViT：用于视频预测的掩码视觉预训练

极市平台

0+阅读 · 2022年6月27日

量化金融强化学习论文集合

量化金融强化学习论文集合

专知

14+阅读 · 2019年12月18日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【SIGIR2018】五篇对抗训练文章

【SIGIR2018】五篇对抗训练文章

专知

12+阅读 · 2018年7月9日

相关论文

VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

Arxiv

0+阅读 · 2023年5月18日

Undercover Deepfakes: Detecting Fake Segments in Videos

Arxiv

0+阅读 · 2023年5月16日

Blind Image Quality Assessment via Transformer Predicted Error Map and Perceptual Quality Token

Arxiv

0+阅读 · 2023年5月16日

Efficient Neural Generation of 4K Masks for Homogeneous Diffusion Inpainting

Arxiv

0+阅读 · 2023年5月16日

Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Arxiv

48+阅读 · 2022年9月7日

MetAug: Contrastive Learning via Meta Feature Augmentation

Arxiv

10+阅读 · 2022年3月10日

Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation

Arxiv

11+阅读 · 2021年12月16日

Generative Models as a Data Source for Multiview Representation Learning

Arxiv

16+阅读 · 2021年6月9日

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

Arxiv

18+阅读 · 2021年4月4日

End-to-End Dense Video Captioning with Masked Transformer

Arxiv

14+阅读 · 2018年4月3日

相关基金

Copine VII在阿尔茨海默病中的作用机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向大数据的安全迁移学习方法

国家自然科学基金

28+阅读 · 2015年12月31日

面向多源大数据的鲁棒聚类模型与算法研究

国家自然科学基金

6+阅读 · 2015年12月31日

基于增量式学习的可扩展偏最小二乘模型的研究

国家自然科学基金

0+阅读 · 2014年12月31日

众核处理器上并行稠密矩阵计算关键技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于多示例学习和半监督学习的手势语识别的研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向人与Agent混合的多团队协作仿真训练方法研究

国家自然科学基金

19+阅读 · 2012年12月31日

新型固载多氮杂环席夫碱介孔分子筛的制备及催化性能研究

国家自然科学基金

0+阅读 · 2009年12月31日

多孔超分子负载的纳米金属催化剂制备及其在有机反应中的应用

国家自然科学基金

0+阅读 · 2009年12月31日

单分散纳米晶－多酸纳米团簇主客体化学研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员