It is always well believed that parsing an image into constituent visual patterns would be helpful for understanding and representing an image. Nevertheless, there has not been evidence in support of the idea on describing an image with a natural-language utterance. In this paper, we introduce a new design to model a hierarchy from instance level (segmentation), region level (detection) to the whole image to delve into a thorough image understanding for captioning. Specifically, we present a HIerarchy Parsing (HIP) architecture that novelly integrates hierarchical structure into image encoder. Technically, an image decomposes into a set of regions and some of the regions are resolved into finer ones. Each region then regresses to an instance, i.e., foreground of the region. Such process naturally builds a hierarchal tree. A tree-structured Long Short-Term Memory (Tree-LSTM) network is then employed to interpret the hierarchal structure and enhance all the instance-level, region-level and image-level features. Our HIP is appealing in view that it is pluggable to any neural captioning models. Extensive experiments on COCO image captioning dataset demonstrate the superiority of HIP. More remarkably, HIP plus a top-down attention-based LSTM decoder increases CIDEr-D performance from 120.1% to 127.2% on COCO Karpathy test split. When further endowing instance-level and region-level features from HIP with semantic relation learnt through Graph Convolutional Networks (GCN), CIDEr-D is boosted up to 130.6%.
翻译:人们总是相信,将图像剖析成成成形视觉图案将有助于理解和代表图像。然而,没有证据表明支持用自然语言的语义描述图像的想法。在本文件中,我们引入了一个新的设计,从实例级别(分层)、区域级别(检测)到整个图像的等级结构模型,以深入到对字幕的完整图像理解。具体地说,我们展示了一种将等级结构重新整合成图像编码器的循环分析(HIP)架构。在技术上,将图像分层结构分解成一组区域,有些区域特性被解成更细的。在本文中,我们引入了一个新的设计,从实例级别(分层)、区域(分层)、区域(检测)到整个图像结构(Tree-LSTM),然后用于解释等级结构结构结构结构结构结构结构(Tree-LSTM)结构结构,并增强所有实例、区域级和图像级的特征。我们的CO-O级数据水平(O-I级),从C-O级水平向上展示了C级水平的升级水平,然后是高端级的图像模型,再展示高端数据模型。