There are limitations in learning language from text alone. Therefore, recent focus has been on developing multimodal models. However, few benchmarks exist that can measure what language models learn about language from multimodal training. We hypothesize that training on a visual modality should improve on the visual commonsense knowledge in language models. Therefore, we introduce two evaluation tasks for measuring visual commonsense knowledge in language models and use them to evaluate different multimodal models and unimodal baselines. Primarily, we find that the visual commonsense knowledge is not significantly different between the multimodal models and unimodal baseline models trained on visual text data.
翻译:仅从文本中学习语言是有局限性的,因此,最近的重点是发展多式联运模式,然而,几乎没有什么基准可以衡量语言模式从多式联运培训中学习语言。我们假设,视觉模式培训应提高语言模式的视觉常识知识。因此,我们引入了两项评估任务,以衡量语言模式中的视觉常识知识,并利用这些知识评价不同的多式联运模式和单一模式基线。我们发现,视觉常识知识在多式联运模式和经过视觉文本数据培训的单一模式基线模型之间没有很大区别。