Deep neural networks (DNNs) have been recently found popular for image captioning problems in remote sensing (RS). Existing DNN based approaches rely on the availability of a training set made up of a high number of RS images with their captions. However, captions of training images may contain redundant information (they can be repetitive or semantically similar to each other), resulting in information deficiency while learning a mapping from the image domain to the language domain. To overcome this limitation, in this paper, we present a novel Summarization Driven Remote Sensing Image Captioning (SD-RSIC) approach. The proposed approach consists of three main steps. The first step obtains the standard image captions by jointly exploiting convolutional neural networks (CNNs) with long short-term memory (LSTM) networks. The second step, unlike the existing RS image captioning methods, summarizes the ground-truth captions of each training image into a single caption by exploiting sequence to sequence neural networks and eliminates the redundancy present in the training set. The third step automatically defines the adaptive weights associated to each RS image to combine the standard captions with the summarized captions based on the semantic content of the image. This is achieved by a novel adaptive weighting strategy defined in the context of LSTM networks. Experimental results obtained on the RSCID, UCM-Captions and Sydney-Captions datasets show the effectiveness of the proposed approach compared to the state-of-the-art RS image captioning approaches. The code of the proposed approach is publicly available at https://gitlab.tubit.tu-berlin.de/rsim/SD-RSIC.
翻译:最近发现深神经网络(DNNS)在遥感(RS)中的图像描述问题中很受欢迎。基于DNN的现有方法取决于是否有一个由大量RS图像组成的培训设施,这些培训设施是由大量RS图像及其字幕组成的。然而,培训图像的说明可能包含冗余信息(它们可能是重复的,或者在语义上彼此相似),导致信息不足,同时学习从图像域到语言域的绘图。为了克服这一局限性,我们在本文件中展示了一个新的 Summariz Dripen遥感图像描述(SD-RSIC) 方法。拟议的方法由三个主要步骤组成。第一个步骤通过联合开发具有长期内存(LSTM)网络的共动神经网络(CNNs)获得标准图像说明。第二步,与现有的RS图像说明方法不同,将每个培训图像的地面图解归纳成单一的字幕,方法是利用神经网络的顺序,并消除培训数据集的冗余方法。第三步自动定义了每个RS图像的适应权重值,通过共同利用卷图解的 RNRalal-real-realalalaldealalalal 将Side 的图像内容与已实现的正缩缩缩图。