基于地貌精细和反射解码的图像说明 (Image Captioning based on Feature Refinement and Reflective Decoding)

Automatically generating a description of an image in natural language is called image captioning. It is an active research topic that lies at the intersection of two major fields in artificial intelligence, computer vision, and natural language processing. Image captioning is one of the significant challenges in image understanding since it requires not only recognizing salient objects in the image but also their attributes and the way they interact. The system must then generate a syntactically and semantically correct caption that describes the image content in natural language. With the significant progress in deep learning models and their ability to effectively encode large sets of images and generate correct sentences, several neural-based captioning approaches have been proposed recently, each trying to achieve better accuracy and caption quality. This paper introduces an encoder-decoder-based image captioning system in which the encoder extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone. This stage is followed by a refining model, which uses an attention-on-attention mechanism to extract the visual features of the target image objects, then determine their interactions. The decoder consists of an attention-based recurrent module and a reflective attention module, which collaboratively apply attention to the visual and textual features to enhance the decoder's ability to model long-term sequential dependencies. Extensive experiments performed on two benchmark datasets, MSCOCO and Flickr30K, show the effectiveness the proposed approach and the high quality of the generated captions.

翻译：自动生成自然语言图像的描述被称为图像字幕。这是一个活跃的研究主题, 位于人工智能、计算机视觉和自然语言处理两个主要领域交汇处。图像字幕是图像理解的重大挑战之一, 因为它不仅需要识别图像中的突出对象, 而且还需要其属性和互动方式。系统随后必须生成一个在策略上和语义上描述自然语言图像内容的同步和语义正确的字幕。随着深层次学习模型的重大进步及其有效编码大套图像和生成正确句子的能力, 最近提出了若干基于神经的字幕方法, 每个都试图提高准确性和字幕质量。本文引入了一个基于编码器的图像描述系统, 因为它不仅需要识别图像中的突出对象, 而且还需要识别其属性和特性。这个系统必须随后生成一个描述自然语言内容内容的精细模型, 利用关注感应感应机制来提取目标图像对象的视觉特征, 然后决定其基于神经导航的描述方法的有效性, 并随后决定其精确度互动性。解码模型的模型和直径定位模型显示一个长期关注度模型, 显示对常规模型的注意和直径定位模型的注意。显示一个循环模型的注意和直径模型的注意。