多式联运蒙面自动编码器学习可转移代表 (Multimodal Masked Autoencoders Learn Transferable Representations)

Building scalable models to learn from diverse, multimodal data remains an open challenge. For vision-language data, the dominant approaches are based on contrastive learning objectives that train a separate encoder for each modality. While effective, contrastive learning approaches introduce sampling bias depending on the data augmentations used, which can degrade performance on downstream tasks. Moreover, these methods are limited to paired image-text data, and cannot leverage widely-available unpaired data. In this paper, we investigate whether a large multimodal model trained purely via masked token prediction, without using modality-specific encoders or contrastive learning, can learn transferable representations for downstream tasks. We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks. Surprisingly, we find that M3AE benefits from a higher text mask ratio (50-90%), in contrast to BERT whose standard masking ratio is 15%, due to the joint training of two data modalities. We also provide qualitative analysis showing that the learned representation incorporates meaningful information from both image and language. Lastly, we demonstrate the scalability of M3AE with larger model size and training time, and its flexibility to train on both paired image-text data as well as unpaired data.

翻译：建立可扩展的模型,以便从多样化的多式联运数据中学习,这仍然是一个开放的挑战。对于视觉语言数据来说,主要的方法是基于对比式学习目标,为每种模式分别培养一个编码器。虽然有效的、对比式学习方法根据所使用的数据增强量而引入抽样偏差,这可以降低下游任务的业绩。此外,这些方法仅限于配对图像文本数据,不能利用广泛获得的不光彩的数据。在本文中,我们调查一个纯粹通过隐蔽的象征性预测培训的大型多式联运模型,不使用特定模式的编码器或对比式学习,是否能够为下游任务学习可转移的表述。我们建议一个简单和可扩展的网络结构,即多式管理式自动编码器(M3AE),它根据所使用的数据增强的编码器,通过掩码预测,为视觉和语言数据的统一编码。我们对M3AE进行了实证性研究,发现M3AE能够学习到向下游任务转移的通用格式。我们发现,M3AE从一个更简单的网络结构结构结构结构中,从一个更高的文本缩缩缩缩的模型到一个15-90的数据模型,我们通过两个标准化的模化的模缩缩缩化的模化的模化的模版数据模型,也展示了一个比比的模型,我们学习了15-50的模版的模化的模化的模版的模化的模版的模版的模版的模版数据。