Recent work has raised concerns about the inherent limitations of text-only pretraining. In this paper, we first demonstrate that reporting bias, the tendency of people to not state the obvious, is one of the causes of this limitation, and then investigate to what extent multimodal training can mitigate this issue. To accomplish this, we 1) generate the Color Dataset (CoDa), a dataset of human-perceived color distributions for 521 common objects; 2) use CoDa to analyze and compare the color distribution found in text, the distribution captured by language models, and a human's perception of color; and 3) investigate the performance differences between text-only and multimodal models on CoDa. Our results show that the distribution of colors that a language model recovers correlates more strongly with the inaccurate distribution found in text than with the ground-truth, supporting the claim that reporting bias negatively impacts and inherently limits text-only training. We then demonstrate that multimodal models can leverage their visual training to mitigate these effects, providing a promising avenue for future research.
翻译:最近的工作引起了人们对仅限文本的预培训内在局限性的关切。 在本文中,我们首先表明报告偏差、人们倾向于不说明显而易见的倾向是这一限制的原因之一,然后调查多式联运培训在多大程度上可以缓解这一问题。为了实现这一目标,我们1 生成了颜色数据集(CoDa),这是521个普通物体的人类感知色分布数据集;2 使用CoDa来分析和比较文本中的颜色分布、语言模型所捕捉的分布以及人类对颜色的看法;以及3 调查CoDa只文本模式和多式联运模式的性能差异。我们的结果显示,语言模型恢复的颜色分布与文本中发现的不准确分布比与地面图案的不准确分布更紧密相关,支持关于报告偏差的负面影响和仅限文本培训的说法。 然后我们证明,多式联运模式能够利用其视觉培训来减轻这些影响,为未来研究提供有希望的渠道。