The availability of clean and diverse labeled data is a major roadblock for training models on complex tasks such as visual question answering (VQA). The extensive work on large vision-and-language models has shown that self-supervised learning is effective for pretraining multimodal interactions. In this technical report, we focus on visual representations. We review and evaluate self-supervised methods to leverage unlabeled images and pretrain a model, which we then fine-tune on a custom VQA task that allows controlled evaluation and diagnosis. We compare energy-based models (EBMs) with contrastive learning (CL). While EBMs are growing in popularity, they lack an evaluation on downstream tasks. We find that both EBMs and CL can learn representations from unlabeled images that enable training a VQA model on very little annotated data. In a simple setting similar to CLEVR, we find that CL representations also improve systematic generalization, and even match the performance of representations from a larger, supervised, ImageNet-pretrained model. However, we find EBMs to be difficult to train because of instabilities and high variability in their results. Although EBMs prove useful for OOD detection, other results on supervised energy-based training and uncertainty calibration are largely negative. Overall, CL currently seems a preferable option over EBMs.
翻译:有关大型视觉和语言模型的广泛工作表明,自我监督的学习对于培训前的多式联运互动是有效的。在本技术报告中,我们侧重于视觉展示。我们审查和评价了利用无标签图像和预演模型的自监督方法,然后我们又对允许控制评估和诊断的常规VQA任务进行了微调。我们比较了基于能源的模型(EBMS)与对比性学习(CL ) 。尽管EBMs越来越受欢迎,但它们缺乏对下游任务的评价。我们发现,EBMs和CL都能从没有标签的图像中学习演示,从而能够在极少的附加说明数据上对VQA模型进行培训。在类似于CLEVR的简单环境中,我们发现CLS表还改进了系统化的概括化,甚至与较大型、受监督的、基于图像网络的模型相比。我们发现EBMs很难被训练,因为目前缺乏稳定性,而且对下游任务进行高度的EBML结果, 也证明它们具有其他的准确性。