Descriptive region features extracted by object detection networks have played an important role in the recent advancements of image captioning. However, they are still criticized for the lack of contextual information and fine-grained details, which in contrast are the merits of traditional grid features. In this paper, we introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features. Concretely, in DLCT, these two features are first processed by a novelDual-way Self Attenion (DWSA) to mine their intrinsic properties, where a Comprehensive Relation Attention component is also introduced to embed the geometric information. In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features, where a geometric alignment graph is constructed to accurately align and reinforce region and grid features. To validate our model, we conduct extensive experiments on the highly competitive MS-COCO dataset, and achieve new state-of-the-art performance on both local and online test sets, i.e., 133.8% CIDEr-D on Karpathy split and 135.4% CIDEr on the official split. Code is available at https://github.com/luo3300612/image-captioning-DLCT.
翻译:由物体探测网络绘制的描述性区域特征在最近的图像说明进展中发挥了重要作用,然而,这些特征仍然受到批评,因为缺乏背景信息和细细细节,而传统网格特征的优点与此形成鲜明对比。在本文件中,我们引入了一个新的双级协作变异器(DLCT)网络,以实现这两个特征的互补优势。具体地说,在DLCT中,这两个特征首先由一个新型的Dual-way SideAttenion(DWSA)处理,以挖掘其内在特性,其中还引入了全面关联关注部分,以嵌入几何信息。此外,我们提议了一个本地和在线测试集,即133.8%的CIDE-ConserCrostical 注意模块,以解决这两个特征直接融合造成的语义性噪音。在这两个特征的构造中,一个几何级调整图可以准确调整和加强区域和网格特征。为了验证我们的模型,我们在高竞争力的MS-CO61数据集上进行广泛的实验,并在本地和在线测试集上实现新的状态-艺术性表现,即:13.8%的CIDER-Dreaction Dreal Codement on CideL.