People say, "A picture is worth a thousand words". Then how can we get the rich information out of the image? We argue that by using visual clues to bridge large pretrained vision foundation models and language models, we can do so without any extra cross-modal training. Thanks to the strong zero-shot capability of foundation models, we start by constructing a rich semantic representation of the image (e.g., image tags, object attributes / locations, captions) as a structured textual prompt, called visual clues, using a vision foundation model. Based on visual clues, we use large language model to produce a series of comprehensive descriptions for the visual content, which is then verified by the vision model again to select the candidate that aligns best with the image. We evaluate the quality of generated descriptions by quantitative and qualitative measurement. The results demonstrate the effectiveness of such a structured semantic representation.
翻译:人们会说,“图片值一千字。”然后我们如何从图像中获取丰富的信息呢?我们争辩说,通过使用视觉线索来连接经过预先训练的大型视觉基础模型和语言模型,我们可以不经过任何额外的跨模式培训就这样做。由于基础模型的强力零射能力,我们首先用一个结构化的文字提示(如图像标记、目标属性/位置、字幕)来构建一个丰富的语义表达方式(如图像标记、目标属性/位置、标题),用视觉基础模型来显示结构化的文字提示,称为视觉线索。根据视觉线索,我们使用大型语言模型来制作一系列视觉内容的全面描述,然后由视觉模型再核实,以选择最适合图像的候选人。我们通过定量和定性测量来评估生成描述的质量。结果显示了结构化语义表达方式的有效性。