Recently, much advance has been made in image captioning, and an encoder-decoder framework has achieved outstanding performance for this task. In this paper, we propose an extension of the encoder-decoder framework by adding a component called guiding network. The guiding network models the attribute properties of input images, and its output is leveraged to compose the input of the decoder at each time step. The guiding network can be plugged into the current encoder-decoder framework and trained in an end-to-end manner. Hence, the guiding vector can be adaptively learned according to the signal from the decoder, making itself to embed information from both image and language. Additionally, discriminative supervision can be employed to further improve the quality of guidance. The advantages of our proposed approach are verified by experiments carried out on the MS COCO dataset.
翻译:最近,在图像字幕方面已经取得了很大进展,而编码器-编码器-编码器框架已经取得了这一任务的杰出业绩。 在本文件中,我们建议通过添加一个名为指导网络的组件来扩展编码器-编码器框架。指导网络模型是输入图像的属性,其输出被利用来组成每个步骤的解码器输入。指导网络可以插进目前的编码器-编码器框架,并以端到端的方式进行培训。因此,指导矢量可以根据解码器的信号来适应性地学习,从而将图像和语言的信息都嵌入其中。此外,还可以利用歧视性的监督来进一步提高指导质量。我们拟议方法的优点可以通过在MS COCO数据集上进行的实验来验证。