Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three important factors which can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality specific attention mechanisms. Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers
翻译:最近,多式联运变压器模型由于在语言和愿景任务方面的表现表现表明它们学习丰富的视觉语言表现方式,因而越来越受欢迎。我们以零光图像检索任务为重点,研究可能影响所学表现质量的三个重要因素:培训前数据、关注机制和损失功能。通过六套数据集的培训前模型,我们观察到,数据集噪音和与我们下游任务类似的语言是模型业绩的重要指标。通过建筑分析,我们了解到,采用多式联运关注机制的模型能够通过模式特定关注机制优于更深的模型。最后,我们表明,在自我监督的学习文献中使用的成功对比性损失在用于多式联运变压器时不会产生类似的绩效收益。