图像字幕(Image Captioning),是指从图像生成文本描述的过程,主要根据图像中物体和物体的动作。

VIP内容

主题: Vision and Language: the text modality in computer vision

简介: 长期以来,文档图像分析一直致力于创建智能阅读系统,其重点仅在于理解以图像形式呈现的文本和图形信息。 另一方面,总体而言,计算机视觉显示出以各种方式利用多模式信息的日益增长的趋势。从一种模态转换为另一种模态或派生出模态之间的联合嵌入是两个关键范式。文本通常是感兴趣的形式之一,尽管很少是指图像形式的文本。 在本教程中,我们将从文档分析和计算机视觉的最新进展中汲取经验,以展示当前如何在最先进的研究中处理作为形式的文本。我们将回顾各种方法和应用,重点关注用于多模式嵌入和跨模式翻译的深度学习技术,这些技术为建模文本和视觉信息之间的相关性提供了非常强大的框架。 本教程将介绍的一些应用程序示例包括:

  • 词点检测,目的是为字符串的视觉(图像)和文本(转录)表示之间的相关性建模。

  • 动态词典生成, 其目的是通过利用场景的视觉信息,动态地提出在图像中极有可能出现的单词字典,以此作为促进后续场景文本识别的手段。 在一种模式(文本)作为另一种模式(图像)的监督信号的情况下,对视觉特征进行自我监督学习,提供了一种学习有用特征的机制,从而避免了昂贵的注释。

  • 图像的跨模式/多模式语义检索, 其目的是对视觉信息和从文本信息中导出的语义之间的相关性进行建模,以实现跨模式图像检索。

  • 图像字幕, 目标是从视觉域转换到文本域(自然语言)。我们将在本教程中讨论的现有方法的有趣变化是,如何将图像中或要描述的图像中的文本信息整合到字幕处理过程中。

嘉宾介绍: Dimosthenis Karatzas是巴塞罗那大学的副教授,西班牙巴塞罗那的计算机视觉中心(CVC)副主任。在CVC,他领导视觉和语言研究领域,在计算机视觉和文本分析的交汇处工作。他与他人合着了100多种参考期刊和会议出版物,H指数为23。他曾获得2013年度IAPR / ICDAR青年研究奖和2017年Google院系研究奖。D. Karatzas在其领域的主要会议(ICDAR,DAS,CBDAR,ICPR,ICFHR)中担任过各种职务,包括-主持IWRR 2014/16/18和CBDAR 2015/17。D. Karatzas是“健壮的阅读比赛”系列的主要组织者。他是国际阅读系统技术委员会的主席。模式识别协会。D. Karatzas是SPIE英国分会的创始成员和执行委员会成员,而他目前是IAPR教育委员会的成员和IEEE IAPR的成员。他是图书馆生活实验室(Library Living Lab)的创始人之一,该实验室是公共图书馆中的开放式参与式创新空间。

成为VIP会员查看完整内容
0
23

最新内容

The contemporary visual captioning models frequently hallucinate objects that are not actually in a scene, due to the visual misclassification or over-reliance on priors that resulting in the semantic inconsistency between the visual information and the target lexical words. The most common way is to encourage the captioning model to dynamically link generated object words or phrases to appropriate regions of the image, i.e., the grounded image captioning (GIC). However, GIC utilizes an auxiliary task (grounding objects) that has not solved the key issue of object hallucination, i.e., the semantic inconsistency. In this paper, we take a novel perspective on the issue above - exploiting the semantic coherency between the visual and language modalities. Specifically, we propose the Consensus Rraph Representation Learning framework (CGRL) for GIC that incorporates a consensus representation into the grounded captioning pipeline. The consensus is learned by aligning the visual graph (e.g., scene graph) to the language graph that consider both the nodes and edges in a graph. With the aligned consensus, the captioning model can capture both the correct linguistic characteristics and visual relevance, and then grounding appropriate image regions further. We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several automatic metrics and human evaluation, the results indicate that the proposed approach can simultaneously improve the performance of image captioning (+2.9 Cider) and grounding (+2.3 F1LOC).

0
0
下载
预览

最新论文

The contemporary visual captioning models frequently hallucinate objects that are not actually in a scene, due to the visual misclassification or over-reliance on priors that resulting in the semantic inconsistency between the visual information and the target lexical words. The most common way is to encourage the captioning model to dynamically link generated object words or phrases to appropriate regions of the image, i.e., the grounded image captioning (GIC). However, GIC utilizes an auxiliary task (grounding objects) that has not solved the key issue of object hallucination, i.e., the semantic inconsistency. In this paper, we take a novel perspective on the issue above - exploiting the semantic coherency between the visual and language modalities. Specifically, we propose the Consensus Rraph Representation Learning framework (CGRL) for GIC that incorporates a consensus representation into the grounded captioning pipeline. The consensus is learned by aligning the visual graph (e.g., scene graph) to the language graph that consider both the nodes and edges in a graph. With the aligned consensus, the captioning model can capture both the correct linguistic characteristics and visual relevance, and then grounding appropriate image regions further. We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several automatic metrics and human evaluation, the results indicate that the proposed approach can simultaneously improve the performance of image captioning (+2.9 Cider) and grounding (+2.3 F1LOC).

0
0
下载
预览
参考链接
Top