Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work has studied these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse performance compared to single modality models. In this work, we show that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data. This single model learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture. In particular, our single pretrained model can be finetuned to achieve 86.5% on ImageNet and 75.3% on the challenging Something Something-v2 video benchmark. Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training.
翻译:以变换器为基础的架构在各种视觉领域,最显著的是图像和视频领域都变得具有竞争力。 虽然先前的工作是孤立地研究这些模式, 但共同架构表明可以对多个视觉模式的单一统一模式进行训练。 先前统一建模的尝试通常使用为视觉任务定制的架构, 或者比单一模式模型的性能差。 在这项工作中, 我们显示, 蒙面的自动编码可以用来在图像和视频上训练一个简单的视野变异器, 不需要任何标签数据。 这个单一模型在图像和视频基准上都学习与单一模式相近或更好的视觉表现, 而同时使用更简单的结构。 特别是, 我们的单一预培训模型可以被微调, 在图像网络上达到86.5%, 在具有挑战性的东西V2视频基准上达到75.3% 。 此外, 这个模型可以通过将图像的90%和视频的95%进行快速培训来学习 。