Image captioning is a challenging task that combines the field of computer vision and natural language processing. A variety of approaches have been proposed to achieve the goal of automatically describing an image, and recurrent neural network (RNN) or long-short term memory (LSTM) based models dominate this field. However, RNNs or LSTMs cannot be calculated in parallel and ignore the underlying hierarchical structure of a sentence. In this paper, we propose a framework that only employs convolutional neural networks (CNNs) to generate captions. Owing to parallel computing, our basic model is around 3 times faster than NIC (an LSTM-based model) during training time, while also providing better results. We conduct extensive experiments on MSCOCO and investigate the influence of the model width and depth. Compared with LSTM-based models that apply similar attention mechanisms, our proposed models achieves comparable scores of BLEU-1,2,3,4 and METEOR, and higher scores of CIDEr. We also test our model on the paragraph annotation dataset, and get higher CIDEr score compared with hierarchical LSTMs
翻译:图像字幕是一项具有挑战性的任务,将计算机视觉和自然语言处理领域结合起来。提出了各种办法,以实现自动描述图像和经常神经网络(RNN)或长期短期内存(LSTM)模型的目标。然而,无法同时计算出RNN或LSTM(LSTM),忽视了一个句子的基本等级结构。在本文件中,我们提议了一个框架,仅使用进化神经网络(CNN)来生成字幕。由于平行计算,我们的基本模型在培训期间比NIC(以LSTM为基础的模型)快3倍左右,同时也提供了更好的结果。我们在MSCO(M)上进行了广泛的实验,并调查模型宽度和深度的影响。与应用类似关注机制的LSTM模型相比,我们拟议的模型的分数与BLEU1、2、3、4和METEOR相当,CIDr的分数也较高。我们还测试了我们关于段落注释数据集的模型,并且获得CIDER等级LSTMS的较高分数。