Vision-language models can assess visual context in an image and generate descriptive text. While the generated text may be accurate and syntactically correct, it is often overly general. To address this, recent work has used optical character recognition to supplement visual information with text extracted from an image. In this work, we contend that vision-language models can benefit from additional information that can be extracted from an image, but are not used by current models. We modify previous multimodal frameworks to accept relevant information from any number of auxiliary classifiers. In particular, we focus on person names as an additional set of tokens and create a novel image-caption dataset to facilitate captioning with person names. The dataset, Politicians and Athletes in Captions (PAC), consists of captioned images of well-known people in context. By fine-tuning pretrained models with this dataset, we demonstrate a model that can naturally integrate facial recognition tokens into generated text by training on limited data. For the PAC dataset, we provide a discussion on collection and baseline benchmark scores.
翻译:视觉语言模型可以在图像中评估视觉背景,并生成描述性文字。 虽然生成的文本可能是准确的, 并且非常正确, 但通常过于笼统。 为了解决这个问题, 最近的工作使用光学字符识别来补充图像中提取的文字的视觉信息。 在这项工作中, 我们坚持认为, 视觉语言模型可以受益于可以从图像中提取的额外信息, 但目前模型并不使用。 我们修改以前的多式框架, 以接受来自任何多个辅助分类器的相关信息。 特别是, 我们把人名作为额外的代号, 并创建一个新的图像描述数据集, 以方便对人名进行字幕。 数据集、 政治学家 和 标题( PAC ) 包含背景中众所周知的人的字幕图像 。 通过对此数据集的预培训模型进行微调, 我们展示了一种可以通过有限数据培训将面部识别符号自然纳入生成文本的模型 。 对于 PAC 数据集, 我们提供了一个关于收集和基准分数的讨论 。