We study the problem of weakly supervised grounded image captioning. That is, given an image, the goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image. This task is challenging due to the lack of explicit fine-grained region word alignments as supervision. Previous weakly supervised methods mainly explore various kinds of regularization schemes to improve attention accuracy. However, their performances are still far from the fully supervised ones. One main issue that has been ignored is that the attention for generating visually groundable words may only focus on the most discriminate parts and can not cover the whole object. To this end, we propose a simple yet effective method to alleviate the issue, termed as partial grounding problem in our paper. Specifically, we design a distributed attention mechanism to enforce the network to aggregate information from multiple spatially different regions with consistent semantics while generating the words. Therefore, the union of the focused region proposals should form a visual region that encloses the object of interest completely. Extensive experiments have demonstrated the superiority of our proposed method compared with the state-of-the-arts.
翻译:我们研究的是受监管不足的图像字幕问题。根据图像,目标是自动生成一个句子来描述图像的背景,每个字都以图像中相应的区域为基础。这项任务具有挑战性,因为缺乏明确的细微区域字对齐作为监督。以前受监管薄弱的方法主要探索各种类型的正规化计划以提高关注度。然而,它们的性能仍远非完全监督的常规化计划。一个被忽略的主要问题是,产生可见的可变语言的注意力可能只集中在最歧视的部分,不能覆盖整个目标。为此,我们提出了一个简单而有效的方法来缓解这一问题,在我们的文件中被称为部分基础问题。具体地说,我们设计了一个分散的关注机制,从多个不同空间区域收集信息,在生成语言时具有一致的语义性。因此,集中区域提案的结合应该形成一个完全包含着兴趣标的视觉区域。广泛的实验表明,我们拟议方法优于状态艺术。