Aided by recent advances in Deep Learning, Image Caption Generation has seen tremendous progress over the last few years. Most methods use transfer learning to extract visual information, in the form of image features, with the help of pre-trained Convolutional Neural Network models followed by transformation of the visual information using a Caption Generator module to generate the output sentences. Different methods have used different Convolutional Neural Network Architectures and, to the best of our knowledge, there is no systematic study which compares the relative efficacy of different Convolutional Neural Network architectures for extracting the visual information. In this work, we have evaluated 17 different Convolutional Neural Networks on two popular Image Caption Generation frameworks: the first based on Neural Image Caption (NIC) generation model and the second based on Soft-Attention framework. We observe that model complexity of Convolutional Neural Network, as measured by number of parameters, and the accuracy of the model on Object Recognition task does not necessarily co-relate with its efficacy on feature extraction for Image Caption Generation task.
翻译:近些年来,在深层学习的最新进展的帮助下,图像导图生成取得了巨大的进步。 多数方法都利用经过预先训练的进化神经网络模型,然后用导图生成模块转换视觉信息以生成输出句子。 不同方法使用了不同的进化神经网络结构,并且根据我们的知识,没有系统的研究来比较不同的进化神经网络结构在提取视觉信息方面的相对效率。 在这项工作中,我们评估了两个流行的图像生成框架:第一个基于神经神经神经网络模型,第二个基于软控制框架。我们观察到,以参数数量衡量的进化神经网络模型的复杂性,以及物体识别任务模型的准确性并不一定与其在图像摄像生成任务上的特征提取效率相匹配。