VideoMAE V2: 带双重掩模的视频自动编码器的规模化应用 (VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking) - 专知论文

会员服务 ·

0

视频自动编码器 · 视频 · 预训练 · 令牌 · 数据集 ·

2023 年 3 月 29 日

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

翻译：VideoMAE V2: 带双重掩模的视频自动编码器的规模化应用

Limin Wang,Bingkun Huang,Zhiyu Zhao,Zhan Tong,Yinan He,Yi Wang,Yali Wang,Yu Qiao

from arxiv, CVPR 2023 camera-ready version

Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner.

翻译：规模是构建强大的基础模型，以便能够很好地推广到各种下游任务的主要因素。然而，训练数十亿个参数的视频基础模型仍然是具有挑战性的。本文展示了视频自动编码器（VideoMAE）作为一个可扩展和通用的自监督预训练器，用于构建视频基础模型。我们在模型和数据方面扩展了VideoMAE，具有核心设计。具体而言，我们提出了双重掩蔽策略，以实现高效的预训练，其中编码器对视频令牌的一个子集进行操作，而解码器处理另一个子集的视频令牌。虽然 VideoMAE 很高效，因为其编码器中具有高掩码比率，但掩码解码器仍然可以进一步减少总体计算成本。这使得在视频中能够高效地预训练十亿级的模型。我们还采用了一个渐进式训练范式，包括在多源未标记数据集上的初始预训练，接着在混合标记数据集上进行后预训练。最后，我们成功训练了一个拥有十亿个参数的视频 ViT 模型，它在 Kinetics 数据集（K400 上的 90.0% 和 K600 上的 89.9%）和 Something-Something 数据集（V1 上的 68.7% 和 V2 上的 77.0%）上实现了新的最先进性能。此外，我们广泛验证了预训练的视频 ViT 模型在各种下游任务上的可行性，证明了其作为一种通用视频表示学习器的有效性。

0

相关内容

视频自动编码器

视频自动编码器

自监督学习在CV进展？何恺明等最新ECCV2022教程《自监督表示学习在计算机视觉》，全面讲述自监督视觉学习进展

自监督学习在CV进展？何恺明等最新ECCV2022教程《自监督表示学习在计算机视觉》，全面讲述自监督视觉学习进展

专知会员服务

54+阅读 · 2022年12月10日

【何恺明组新论文】掩码自编码器作为时空学习器，Masked Autoencoders As Spatiotemporal Learners

【何恺明组新论文】掩码自编码器作为时空学习器，Masked Autoencoders As Spatiotemporal Learners

专知会员服务

39+阅读 · 2022年5月19日

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

专知会员服务

16+阅读 · 2022年3月3日

【ICLR2022】UniFormer：无缝集成 Transformer，更高效的时空表征学习框架

【ICLR2022】UniFormer：无缝集成 Transformer，更高效的时空表征学习框架

专知会员服务

50+阅读 · 2022年2月16日

何恺明最新论文！用于计算机视觉的可扩展自监督学习方案Masked AutoEncoders

何恺明最新论文！用于计算机视觉的可扩展自监督学习方案Masked AutoEncoders

专知会员服务

30+阅读 · 2021年11月13日

【SIGIR2020】一个统一的双视图模型，用于具有不一致性损失的评论总结和情绪分类，A Unified Dual-view Model for Review Summarization and Sentiment Classification with Inconsistency Loss

【SIGIR2020】一个统一的双视图模型，用于具有不一致性损失的评论总结和情绪分类，A Unified Dual-view Model for Review Summarization and Sentiment Classification with Inconsistency Loss

专知会员服务

22+阅读 · 2020年6月3日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【AAAI2020-Oral】自监督时空学习的视频完形程序，Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

【AAAI2020-Oral】自监督时空学习的视频完形程序，Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

专知会员服务

30+阅读 · 2020年1月2日

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

专知会员服务

24+阅读 · 2019年12月15日

VideoMAE：简单高效的视频自监督预训练新范式｜NeurIPS 2022

VideoMAE：简单高效的视频自监督预训练新范式｜NeurIPS 2022

新智元

0+阅读 · 2022年11月28日

NeurIPS 2022｜VideoMAE: 简单高效的视频自监督预训练新范式

NeurIPS 2022｜VideoMAE: 简单高效的视频自监督预训练新范式

极市平台

1+阅读 · 2022年11月1日

自动化所11篇NeurIPS 2022新作速览！

自动化所11篇NeurIPS 2022新作速览！

专知

0+阅读 · 2022年10月5日

自监督榜首！字节跳动提出视觉预训练模型dBOT，重新审视Masked Image Modeling

自监督榜首！字节跳动提出视觉预训练模型dBOT，重新审视Masked Image Modeling

PaperWeekly

0+阅读 · 2022年9月25日

何恺明团队的“视频版本MAE”，高效视频预训练！Mask Ratio高达90%时效果也很好！

何恺明团队的“视频版本MAE”，高效视频预训练！Mask Ratio高达90%时效果也很好！

夕小瑶的卖萌屋

0+阅读 · 2022年6月14日

【三星AI-CVPR2020】增量小样本目标检测，Incremental Few-Shot Object Detection

【三星AI-CVPR2020】增量小样本目标检测，Incremental Few-Shot Object Detection

专知

55+阅读 · 2020年3月11日

文本+视觉，多篇 Visual/Video BERT 论文介绍

文本+视觉，多篇 Visual/Video BERT 论文介绍

AI科技评论

22+阅读 · 2019年8月30日

视频分析/多模态学习论文、代码、数据集大列表

视频分析/多模态学习论文、代码、数据集大列表

专知

57+阅读 · 2019年7月13日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

大规模多视角高维图像特征提取

国家自然科学基金

3+阅读 · 2017年12月31日

确定性单光子的固态量子存储

国家自然科学基金

0+阅读 · 2013年12月31日

基于数据粒度模型的3DGIS实时可视化数据组织管理与绘制优化方法

国家自然科学基金

1+阅读 · 2012年12月31日

基于数据与模型混合驱动的密集人群中特定人脸持续跟踪方法

国家自然科学基金

0+阅读 · 2012年12月31日

微结构光纤回音壁模式机理及应用的研究

国家自然科学基金

0+阅读 · 2012年12月31日

抗癌症干细胞天然产物Rakicidin A的合成及构效关系研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于流形学习的视频人脸识别方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

利用克尔介质中的自衍射效应净化飞秒超强激光脉冲的研究

国家自然科学基金

0+阅读 · 2011年12月31日

多发性硬化Th17和Treg细胞失衡的miRNA调控机制研究

国家自然科学基金

0+阅读 · 2010年12月31日

自体灭活T细胞免疫下调调节性T细胞的机理与应用研究

国家自然科学基金

0+阅读 · 2008年12月31日

SurgMAE: Masked Autoencoders for Long Surgical Video Analysis

Arxiv

0+阅读 · 2023年5月19日

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Mode

Arxiv

0+阅读 · 2023年5月19日

How does the task complexity of masked pretraining objectives affect downstream performance?

Arxiv

0+阅读 · 2023年5月18日

IDO-VFI: Identifying Dynamics via Optical Flow Guidance for Video Frame Interpolation with Events

Arxiv

0+阅读 · 2023年5月18日

EfficientSCI: Densely Connected Network with Space-time Factorization for Large-scale Video Snapshot Compressive Imaging

EfficientSCI: Densely Connected Network with Space-time Factorization for Large-scale Video Snapshot Compressive Imaging

Arxiv

0+阅读 · 2023年5月18日

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding

Arxiv

12+阅读 · 2021年12月30日

Masked Autoencoders Are Scalable Vision Learners

Arxiv

27+阅读 · 2021年11月11日

K-Net: Towards Unified Image Segmentation

Arxiv

12+阅读 · 2021年11月1日

End-to-End Video Instance Segmentation with Transformers

Arxiv

10+阅读 · 2021年3月24日

End-to-End Dense Video Captioning with Masked Transformer

Arxiv

14+阅读 · 2018年4月3日

VIP会员

文章信息

相关主题

视频自动编码器

相关VIP内容

自监督学习在CV进展？何恺明等最新ECCV2022教程《自监督表示学习在计算机视觉》，全面讲述自监督视觉学习进展

自监督学习在CV进展？何恺明等最新ECCV2022教程《自监督表示学习在计算机视觉》，全面讲述自监督视觉学习进展

专知会员服务

54+阅读 · 2022年12月10日

【何恺明组新论文】掩码自编码器作为时空学习器，Masked Autoencoders As Spatiotemporal Learners

【何恺明组新论文】掩码自编码器作为时空学习器，Masked Autoencoders As Spatiotemporal Learners

专知会员服务

39+阅读 · 2022年5月19日

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

专知会员服务

16+阅读 · 2022年3月3日

【ICLR2022】UniFormer：无缝集成 Transformer，更高效的时空表征学习框架

【ICLR2022】UniFormer：无缝集成 Transformer，更高效的时空表征学习框架

专知会员服务

50+阅读 · 2022年2月16日

何恺明最新论文！用于计算机视觉的可扩展自监督学习方案Masked AutoEncoders

何恺明最新论文！用于计算机视觉的可扩展自监督学习方案Masked AutoEncoders

专知会员服务

30+阅读 · 2021年11月13日

【SIGIR2020】一个统一的双视图模型，用于具有不一致性损失的评论总结和情绪分类，A Unified Dual-view Model for Review Summarization and Sentiment Classification with Inconsistency Loss

【SIGIR2020】一个统一的双视图模型，用于具有不一致性损失的评论总结和情绪分类，A Unified Dual-view Model for Review Summarization and Sentiment Classification with Inconsistency Loss

专知会员服务

22+阅读 · 2020年6月3日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【AAAI2020-Oral】自监督时空学习的视频完形程序，Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

【AAAI2020-Oral】自监督时空学习的视频完形程序，Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

专知会员服务

30+阅读 · 2020年1月2日

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

专知会员服务

24+阅读 · 2019年12月15日

热门VIP内容

开通专知VIP会员享更多权益服务

《人工智能绝不能完全自主》

《人工智能的法律与伦理：军事自主机器独特挑战的深度剖析》316页

从数据到主导：AI与兵棋推演构筑决策优势

《特洛伊木马货柜：武器化集装箱的战略威胁》最新报告

相关资讯

VideoMAE：简单高效的视频自监督预训练新范式｜NeurIPS 2022

VideoMAE：简单高效的视频自监督预训练新范式｜NeurIPS 2022

新智元

0+阅读 · 2022年11月28日

NeurIPS 2022｜VideoMAE: 简单高效的视频自监督预训练新范式

NeurIPS 2022｜VideoMAE: 简单高效的视频自监督预训练新范式

极市平台

1+阅读 · 2022年11月1日

自动化所11篇NeurIPS 2022新作速览！

自动化所11篇NeurIPS 2022新作速览！

专知

0+阅读 · 2022年10月5日

自监督榜首！字节跳动提出视觉预训练模型dBOT，重新审视Masked Image Modeling

自监督榜首！字节跳动提出视觉预训练模型dBOT，重新审视Masked Image Modeling

PaperWeekly

0+阅读 · 2022年9月25日

何恺明团队的“视频版本MAE”，高效视频预训练！Mask Ratio高达90%时效果也很好！

何恺明团队的“视频版本MAE”，高效视频预训练！Mask Ratio高达90%时效果也很好！

夕小瑶的卖萌屋

0+阅读 · 2022年6月14日

【三星AI-CVPR2020】增量小样本目标检测，Incremental Few-Shot Object Detection

【三星AI-CVPR2020】增量小样本目标检测，Incremental Few-Shot Object Detection

专知

55+阅读 · 2020年3月11日

文本+视觉，多篇 Visual/Video BERT 论文介绍

文本+视觉，多篇 Visual/Video BERT 论文介绍

AI科技评论

22+阅读 · 2019年8月30日

视频分析/多模态学习论文、代码、数据集大列表

视频分析/多模态学习论文、代码、数据集大列表

专知

57+阅读 · 2019年7月13日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

相关论文

SurgMAE: Masked Autoencoders for Long Surgical Video Analysis

Arxiv

0+阅读 · 2023年5月19日

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Mode

Arxiv

0+阅读 · 2023年5月19日

How does the task complexity of masked pretraining objectives affect downstream performance?

Arxiv

0+阅读 · 2023年5月18日

IDO-VFI: Identifying Dynamics via Optical Flow Guidance for Video Frame Interpolation with Events

Arxiv

0+阅读 · 2023年5月18日

EfficientSCI: Densely Connected Network with Space-time Factorization for Large-scale Video Snapshot Compressive Imaging

EfficientSCI: Densely Connected Network with Space-time Factorization for Large-scale Video Snapshot Compressive Imaging

Arxiv

0+阅读 · 2023年5月18日

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding

Arxiv

12+阅读 · 2021年12月30日

Masked Autoencoders Are Scalable Vision Learners

Arxiv

27+阅读 · 2021年11月11日

K-Net: Towards Unified Image Segmentation

Arxiv

12+阅读 · 2021年11月1日

End-to-End Video Instance Segmentation with Transformers

Arxiv

10+阅读 · 2021年3月24日

End-to-End Dense Video Captioning with Masked Transformer

Arxiv

14+阅读 · 2018年4月3日

相关基金

大规模多视角高维图像特征提取

国家自然科学基金

3+阅读 · 2017年12月31日

确定性单光子的固态量子存储

国家自然科学基金

0+阅读 · 2013年12月31日

基于数据粒度模型的3DGIS实时可视化数据组织管理与绘制优化方法

国家自然科学基金

1+阅读 · 2012年12月31日

基于数据与模型混合驱动的密集人群中特定人脸持续跟踪方法

国家自然科学基金

0+阅读 · 2012年12月31日

微结构光纤回音壁模式机理及应用的研究

国家自然科学基金

0+阅读 · 2012年12月31日

抗癌症干细胞天然产物Rakicidin A的合成及构效关系研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于流形学习的视频人脸识别方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

利用克尔介质中的自衍射效应净化飞秒超强激光脉冲的研究

国家自然科学基金

0+阅读 · 2011年12月31日

多发性硬化Th17和Treg细胞失衡的miRNA调控机制研究

国家自然科学基金

0+阅读 · 2010年12月31日

自体灭活T细胞免疫下调调节性T细胞的机理与应用研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员