In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Transformer Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF). LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors. It reduces the difficulty of local object recognition during captioning. LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity. With these two novel designs, the proposed LSTNet can model the local visual information of grid features to improve the captioning quality. To validate LSTNet, we conduct extensive experiments on the competitive MS-COCO benchmark. The experimental results show that LSTNet is not only capable of local visual modeling, but also outperforms a bunch of state-of-the-art captioning models on offline and online testings, i.e., 134.8 CIDEr and 136.3 CIDEr, respectively. Besides, the generalization of LSTNet is also verified on the Flickr8k and Flickr30k datasets
翻译:在本文中,我们用图像字幕的网格特征研究本地视觉模型,这是生成准确和详细字幕的关键所在。为了实现这一目标,我们提议建立一个地方感知变异器网络(LSTNet),有两种新设计,即地方感知关注(LSA)和地方感知融合(LSF)。LSA通过建模每个网及其邻居之间的关系,为变异器内部互动配置了内部图像模型。它减少了在字幕中识别本地物体的难度。LSF用于跨层信息聚合,将不同编码层的信息汇总起来,实现跨层的语义互补。根据这两个新设计,拟议的LSTNet可以模拟本地网络特征的视觉信息,以提高字幕质量。为了验证LSTNet,我们为具有竞争力的MS-CO公司基准进行了广泛的实验。实验结果表明,LSTNet不仅能够进行本地视觉建模,而且超越了在离线和在线的CISD8 和Flickr 分别对C-34版和Flickr的C-3版和FIDFlickr的数据进行了核实。