We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks while being data, memory and run-time efficient. We make three key contributions. First, in contrast to most existing works, we use a single transformer with all the encoder layers processing both the text and the image modalities. Second, we propose a stage-wise training strategy where the model is first trained on images, then jointly with unimodal text and image datasets and finally jointly with text and text-image datasets. Third, to preserve information across both the modalities, we propose a training pipeline that learns simultaneously from gradient updates of different modalities at each training update step. The results on downstream text-only, image-only and multimodal tasks show that our model is competitive with several strong models while using fewer parameters and lesser pre-training data. For example, MoMo performs competitively with FLAVA on multimodal (+3.1), image-only (+1.1) and text-only (-0.1) tasks despite having 2/5th the number of parameters and using 1/3rd the image-text training pairs. Finally, we ablate various design choices and further show that increasing model size produces significant performance gains indicating potential for substantial improvements with larger models using our approach.
翻译:我们提出了一种自监督的共享编码器模型,它在几个视觉、语言和多模式基准测试中取得了强有力的结果,同时具有数据、内存和运行时的高效性。我们做出了三个关键贡献。首先,与大多数现有作品不同,我们使用一个transformer,所有编码器层处理文本和图像模态。其次,我们提出了一种阶段性的训练策略,其中模型先在图像上进行训练,然后与单模态文本和图像数据集一起联合训练,并最终与文本和文本-图像数据集一起联合训练。第三,为了跨模态保留信息,我们提出了一种训练流程,该流程会同时从每个训练更新步骤不同模态的渐变更新中学习。下游文本、图像和多模态任务的结果表明,我们的模型与几个强大的模型相比具有竞争力,同时使用更少的参数和较少的预训练数据。例如,MoMo在多模态(+3.1)、仅图像(+1.1)和仅文本(-0.1)任务上与FLAVA竞争,尽管其参数数量为2/5,图像-文本训练对为1/3。最后,我们揭示了各种设计选择并进一步表明增加模型规模会产生显著的性能提升,表明我们的方法在使用更大的模型时存在潜在的大幅度改进。