Large language models are known to suffer from the hallucination problem in that they are prone to output statements that are false or inconsistent, indicating a lack of knowledge. A proposed solution to this is to provide the model with additional data modalities that complements the knowledge obtained through text. We investigate the use of visual data to complement the knowledge of large language models by proposing a method for evaluating visual knowledge transfer to text for uni- or multimodal language models. The method is based on two steps, 1) a novel task querying for knowledge of memory colors, i.e. typical colors of well-known objects, and 2) filtering of model training data to clearly separate knowledge contributions. Additionally, we introduce a model architecture that involves a visual imagination step and evaluate it with our proposed method. We find that our method can successfully be used to measure visual knowledge transfer capabilities in models and that our novel model architecture shows promising results for leveraging multimodal knowledge in a unimodal setting.
翻译:据了解,大型语言模型存在幻觉问题,因为它们容易出现虚假或不一致的输出语句,这表明缺乏知识。这方面的一个拟议解决办法是向模型提供补充通过文本获得的知识的额外数据模式。我们调查视觉数据的使用情况,以补充大语言模型的知识,方法是提出一种方法来评价视觉知识向单方或多式语言模型文本的转移。这种方法基于两个步骤:1)一项是调查记忆颜色知识的新任务,即众所周知的物体的典型颜色,2)过滤示范培训数据,以明确区分知识贡献。此外,我们引入一个包含视觉想象力步骤的模型结构,并用我们提议的方法对它进行评估。我们发现,我们的方法可以成功地用来衡量模型中的视觉知识转移能力,我们的新模型结构显示了在单式环境中利用多式联运知识的可喜成果。