We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600,and 41.1% on Moments in Time, new records while avoiding supervised pre-training. Transferring to image classification leads to 78.7% top-1 accuracy on ImageNet compared to 64.7% by training the same Transformer from scratch, showing the generalizability of our model despite the domain gap between videos and images. VATT's audio Transformer also sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training.
翻译:我们提出了一个框架,用于利用没有标签的数据,利用无革命性变形器结构,从没有标签的数据中学习多式表述。具体地说,我们的视频-Audio-Text变形器(VATT)以原始信号作为投入,并提取出足以有益于各种下游任务的多式代表器。我们利用多式反差损失,从头到尾对VATT进行培训,评价其通过视频行动识别、音频事件分类、图像分类和文本到视频检索等下游任务的业绩。此外,我们还研究一种模式-无创意单背骨变形器(VATT),在三种模式之间分享重量。我们显示,在下游任务中,无革命性VATT超越了以ConvNet为基础的结构。 特别是,VATT的愿景变形器实现了82.1%的顶级准确度,在肯提斯-400上,83.6%的基雅特-600上,41.1%的音频-600,在时间制新记录上,避免受监督的训练前训练。向图像分类导致图像网络上78.7%的顶端-1级分类,在图像网络网络上比64.7%,在常规图像上显示通用图像的升级的升级记录上,在一般记录上,在常规记录上,在一般记录上也显示。