In recent years, transformer structures have been widely applied in image captioning with impressive performance. For good captioning results, the geometry and position relations of different visual objects are often thought of as crucial information. Aiming to further promote image captioning by transformers, this paper proposes an improved Geometry Attention Transformer (GAT) model. In order to further leverage geometric information, two novel geometry-aware architectures are designed respectively for the encoder and decoder in our GAT. Besides, this model includes the two work modules: 1) a geometry gate-controlled self-attention refiner, for explicitly incorporating relative spatial information into image region representations in encoding steps, and 2) a group of position-LSTMs, for precisely informing the decoder of relative word position in generating caption texts. The experiment comparisons on the datasets MS COCO and Flickr30K show that our GAT is efficient, and it could often outperform current state-of-the-art image captioning models.
翻译:近年来,变压器结构在图像字幕中广泛应用,其性能令人印象深刻。对于良好的字幕结果而言,不同视觉物体的几何和位置关系常常被视为关键信息。为了进一步推广变压器的图像字幕,本文件建议改进几何注意变异器模型。为了进一步利用几何信息,我们GAT中为编码器和解码器分别设计了两个新的几何觉结构。此外,这一模型包括两个工作模块:1)一个几何门控制的自控改进器,在编码步骤中明确将相对空间信息纳入图像区域显示,2个位置LSTMs,准确告知生成字幕文本相对词位置的解码器。关于数据集MS COCO和Flick30K的实验比较表明,我们的GAT效率很高,而且常常超过当前最先进的图像描述模型。