Capenrich:通过超模式预培训知识为网络图像增加内容 (CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge)

Automatically generating textual descriptions for massive unlabeled images on the web can greatly benefit realistic web applications, e.g. multimodal retrieval and recommendation. However, existing models suffer from the problem of generating ``over-generic'' descriptions, such as their tendency to generate repetitive sentences with common concepts for different images. These generic descriptions fail to provide sufficient textual semantics for ever-changing web images. Inspired by the recent success of Vision-Language Pre-training (VLP) models that learn diverse image-text concept alignment during pretraining, we explore leveraging their cross-modal pre-trained knowledge to automatically enrich the textual semantics of image descriptions. With no need for additional human annotations, we propose a plug-and-play framework, i.e CapEnrich, to complement the generic image descriptions with more semantic details. Specifically, we first propose an automatic data-building strategy to get desired training sentences, based on which we then adopt prompting strategies, i.e. learnable and template prompts, to incentivize VLP models to generate more textual details. For learnable templates, we fix the whole VLP model and only tune the prompt vectors, which leads to two advantages: 1) the pre-training knowledge of VLP models can be reserved as much as possible to describe diverse visual concepts; 2) only lightweight trainable parameters are required, so it is friendly to low data resources. Extensive experiments show that our method significantly improves the descriptiveness and diversity of generated sentences for web images. Our code will be released.

翻译：在网络上自动生成大规模未贴标签图像的文本描述,可以极大地有益于现实的网络应用程序,例如多式联运检索和建议。然而,现有模型存在生成“超通用”描述的问题,例如它们倾向于生成具有不同图像共同概念的重复句子。这些通用描述未能为不断变化的网络图像提供足够的文本语义。受最近成功开发的“视觉-语言预科培训”模型的启发,这些模型在培训前学习多种图像文本概念调整,我们探索如何利用其跨模式的预培训知识自动丰富图像描述的文字语义。在不需要额外的人类说明的情况下,我们建议一个插接和播放框架,即CapEnrich,以更多的语义细节补充通用图像描述。具体地说,我们首先提出一个自动数据构建战略,以获得所需的培训句子,然后我们采取提示性战略,即学习和模板提示,然后将VLP的预培训模型仅仅用于生成更多的文本细节细节细节。对于可学习性描述性描述性描述性描述性描述性描述性描述性描述性说明性说明性说明性说明性说明性说明性说明性说明性说明性概念,我们将大大地将利用VLLA型模型,从而将快速地描述性模型,从而将快速地描述性解释性解释性解释性模型,从而将使得整个矢量模型成为我们所需要的工具的精度说明性模型。