OmniMAE: 图像和录像单一模型蒙面预科培训 (OmniMAE: Single Model Masked Pretraining on Images and Videos)

Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work has studied these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse performance compared to single modality models. In this work, we show that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data. This single model learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture. In particular, our single pretrained model can be finetuned to achieve 86.5% on ImageNet and 75.3% on the challenging Something Something-v2 video benchmark. Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training.

翻译：以变换器为基础的架构在各种视觉领域,最显著的是图像和视频领域都变得具有竞争力。虽然先前的工作是孤立地研究这些模式, 但共同架构表明可以对多个视觉模式的单一统一模式进行训练。先前统一建模的尝试通常使用为视觉任务定制的架构, 或者比单一模式模型的性能差。在这项工作中, 我们显示, 蒙面的自动编码可以用来在图像和视频上训练一个简单的视野变异器, 不需要任何标签数据。这个单一模型在图像和视频基准上都学习与单一模式相近或更好的视觉表现, 而同时使用更简单的结构。特别是, 我们的单一预培训模型可以被微调, 在图像网络上达到86.5%, 在具有挑战性的东西V2视频基准上达到75.3% 。此外, 这个模型可以通过将图像的90%和视频的95%进行快速培训来学习。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日