Image captioning is the process of automatically generating a description of an image in natural language. Image captioning is one of the significant challenges in image understanding since it requires not only recognizing salient objects in the image but also their attributes and the way they interact. The system must then generate a syntactically and semantically correct caption that describes the image content in natural language. With the significant progress in deep learning models and their ability to effectively encode large sets of images and generate correct sentences, several neural-based captioning approaches have been proposed recently, each trying to achieve better accuracy and caption quality. This paper introduces an encoder-decoder-based image captioning system in which the encoder extracts spatial features from the image using ResNet-101. This stage is followed by a refining model, which uses an attention-on-attention mechanism to extract the visual features of the target image objects, then determine their interactions. The decoder consists of an attention-based recurrent module and a reflective attention module, which collaboratively apply attention to the visual and textual features to enhance the decoder's ability to model long-term sequential dependencies. Extensive experiments performed on Flickr30K, show the effectiveness of the proposed approach and the high quality of the generated captions.
翻译:图像字幕是自动生成自然语言图像描述的过程。 图像字幕是图像理解的重大挑战之一, 因为它不仅需要识别图像中的突出对象, 还需要识别其属性和互动方式。 系统随后必须生成一个以自然语言描述图像内容的同步和语义正确的字幕。 随着深层次学习模型的重大进步及其有效编码大量图像和生成正确句子的能力, 最近提出了几种基于神经的字幕方法, 每种方法都试图达到更好的准确性和字幕质量。 本文引入了一种基于 incoder- decoder 的图像描述系统, 其中编码器使用 ResNet- 101 从图像中提取空间特征。 此阶段之后有一个精细化模型, 该模型使用注意机制提取目标图像对象的视觉特征, 然后决定其互动。 解调器由基于关注的经常性模块和一个反射关注模块组成, 以协作方式关注视觉和文字的图像和文字特征, 以增强解译器在模型解码器上的能力, 使用 ResNet- 101 和 高清晰度的直径 实验 。