Image captioning is an interdisciplinary research problem that stands between computer vision and natural language processing. The task is to generate a textual description of the content of an image. The typical model used for image captioning is an encoder-decoder deep network, where the encoder captures the essence of an image while the decoder is responsible for generating a sentence describing the image. Attention mechanisms can be used to automatically focus the decoder on parts of the image which are relevant to predict the next word. In this paper, we explore different decoders and attentional models popular in neural machine translation, namely attentional recurrent neural networks, self-attentional transformers, and fully-convolutional networks, which represent the current state of the art of neural machine translation. The image captioning module is available as part of SOCKEYE at https://github.com/awslabs/sockeye which tutorial can be found at https://awslabs.github.io/sockeye/image_captioning.html .
翻译:图像字幕是一个跨学科的研究问题,存在于计算机视觉和自然语言处理之间。 任务在于生成图像内容的文字描述。 用于图像字幕的典型模型是一个深网络编码器- 解码器, 编码器捕捉图像的精髓, 而解码器则负责生成描述图像的句子。 注意机制可以用来自动将解码器的焦点集中在图像中与预测下一个词相关的部分。 在本文中, 我们探索神经机器翻译中流行的不同解码器和关注模型, 即注意的经常神经网络、 自我注意变异器和完全进化网络, 它代表神经机器翻译的艺术现状。 图像字幕模块作为 SOCKEEE 的一部分, 可在https://github.com/awslabs/sockeyeye https://awslabs.github.io/sockeyeye/image_captioning.html上查阅。