Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our 'Omnivore' model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. Omnivore is simple to train, uses off-the-shelf standard datasets, and performs at-par or better than modality-specific models of the same size. A single Omnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN RGB-D. After finetuning, our models outperform prior work on a variety of vision tasks and generalize across modalities. Omnivore's shared visual representation naturally enables cross-modal recognition without access to correspondences between modalities. We hope our results motivate researchers to model visual modalities together.
翻译:先前的工作是单独研究不同的视觉模式, 并开发了不同的图像、 视频和 3D 数据识别结构。 相反, 在本文中, 我们提出一个单一模型, 这个模型在图像、 视频和单一视图 3D 数据分类方面使用完全相同的模型参数。 我们的“ omnivore ” 模型利用基于变压器的架构的灵活性, 并共同接受不同模式的分类任务培训。 Omnivore 简单易培训, 使用现成的标准数据集, 并进行类似大小的、 与特定模式不同的模型的平行或更好的演化。 一个单一的 Omnivore 模型在图像网络上获得86.0%, 动因学获得84.1%, SUN RGB- D 获得67.1%。 经过微调, 我们的模型超越了先前关于各种视觉任务的工作, 并且对不同模式进行总体化。 Omnivore 共享的视觉表现自然使得跨模式的识别, 无法获取各种模式之间的对应。 我们希望我们的成果能激励研究人员一起模拟视觉模式。