Recently Convolutional Neural Networks have been proposed for Sequence Modelling tasks such as Image Caption Generation. However, unlike Recurrent Neural Networks, the performance of Convolutional Neural Networks as Decoders for Image Caption Generation has not been extensively studied. In this work, we analyse various aspects of Convolutional Neural Network based Decoders such as Network complexity and depth, use of Data Augmentation, Attention mechanism, length of sentences used during training, etc on performance of the model. We perform experiments using Flickr8k and Flickr30k image captioning datasets and observe that unlike Recurrent Neural Network based Decoder, Convolutional Decoder for Image Captioning does not generally benefit from increase in network depth, in the form of stacked Convolutional Layers, and also the use of Data Augmentation techniques. In addition, use of Attention mechanism also provides limited performance gains with Convolutional Decoder. Furthermore, we observe that Convolutional Decoders show performance comparable with Recurrent Decoders only when trained using sentences of smaller length which contain up to 15 words but they have limitations when trained using higher sentence lengths which suggests that Convolutional Decoders may not be able to model long-term dependencies efficiently. In addition, the Convolutional Decoder usually performs poorly on CIDEr evaluation metric as compared to Recurrent Decoder.
翻译:最近提出了用于序列建模任务的革命神经网络,如图像显示生成等。然而,与经常的神经网络不同,没有广泛研究革命神经网络作为图像显示器的演化器的性能。在这项工作中,我们分析了基于革命神经网络的分解器的各个方面,如网络的复杂度和深度、数据增加率的使用、注意机制、培训期间所用刑期长短等。我们使用Flickr8k和Flickr30k图像说明数据集进行实验,发现与经常的神经网络基于分解器的分解器不同的是,革命神经网络作为图像显示器的分解器的性能一般没有从网络深度的提高中获益,其形式是堆叠变层层,以及数据放大技术的使用。此外,使用关注机制也为革命分解器的性能增益有限。此外,我们注意到,只有在经过培训,使用较短长度的内含15个字但具有局限性的分解器,通常在经过培训后进行低级的分解后,在经过长期的递进期演后会受到限制。