Existing image captioning systems are dedicated to generating narrative captions for images, which are spatially detached from the image in presentation. However, texts can also be used as decorations on the image to highlight the key points and increase the attractiveness of images. In this work, we introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information. To fully exploit the surrounding visual context to generate the most suitable caption for each location, we propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations from easy to difficult. Since the model may generate redundant captions for nearby locations, we further enhance the location embedding with neighbor locations as context. For this new task, we also introduce a large-scale benchmark called CapOnImage2M, which contains 2.1 million product images, each with an average of 4.8 spatially localized captions. Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects. We will make code and datasets public to facilitate future research.
翻译:现有的图像字幕系统专门为图像制作叙述性字幕,在空间上与演示中的图像分离。然而,文本也可以用作图像的装饰品,以突出关键点并增加图像的吸引力。在这项工作中,我们引入了名为图像字幕的新任务(CapOnimage),目的是根据背景信息在不同地点生成图像密集字幕。为充分利用周围的视觉环境为每个位置生成最合适的字幕,我们提议了一个多层次的培训前培训模式,其培训前任务将逐步从容易到难以学习文本和图像位置之间的对应。由于该模式可能会为附近位置生成多余的字幕,我们进一步加强与周边位置相邻位置的嵌入位置。对于这一新任务,我们还引入了一个名为CapOnimage2M(CapOnimage2M)的大型基准,该基准包含210万个产品图像,每个图像平均有4.8个空间本地化字幕。与其他图像字幕模型变量相比,我们的模型在描述准确性和多样性两个方面都取得了最佳结果。我们将制作代码和数据设置,以便利未来研究。