Despite the superior performance brought by vision-and-language pretraining, it remains unclear whether learning with multi-modal data can help understand each individual modality. In this work, we investigate how language can help with visual representation learning from a probing perspective. Specifically, we compare vision-and-language and vision-only models by probing their visual representations on a broad range of tasks, in order to assess the quality of the learned representations in a fine-grained manner. Interestingly, our probing results suggest that vision-and-language models are better at label prediction tasks like object and attribute prediction, while vision-only models are stronger at dense prediction tasks that require more localized information. With further analysis using detailed metrics, our study suggests that language helps vision models learn better semantics, but not localization. Code is released at https://github.com/Lizw14/visual_probing.
翻译:尽管通过视力和语言预科培训取得了优异的成绩,但仍不清楚使用多模式数据学习是否能帮助理解每一种模式。 在这项工作中,我们调查语言如何有助于从探索角度进行视觉表现学习。 具体地说,我们比较视觉和语言模型,通过在广泛的任务中进行视觉表现来比较仅视景模式,以便以细微的眼光评估所学表现的质量。有趣的是,我们的考察结果表明,视觉和语言模型在标定目标预测和属性预测等预测任务方面做得更好,而只视景模型在密集的预测任务中则更强大,需要更本地化的信息。通过使用详细指标的进一步分析,我们的研究显示,语言有助于视觉模型学习更好的语义学,而不是本地化。 代码发布在 https://github.com/Lizw14/visual_probing 上。