Automated image captioning is one of the applications of Deep Learning which involves fusion of work done in computer vision and natural language processing, and it is typically performed using Encoder-Decoder architectures. In this project, we have implemented and experimented with various flavors of multi-modal image captioning networks where ResNet101, DenseNet121 and VGG19 based CNN Encoders and Attention based LSTM Decoders were explored. We have studied the effect of beam size and the use of pretrained word embeddings and compared them to baseline CNN encoder and RNN decoder architecture. The goal is to analyze the performance of each approach using various evaluation metrics including BLEU, CIDEr, ROUGE and METEOR. We have also explored model explainability using Visual Attention Maps (VAM) to highlight parts of the images which has maximum contribution for predicting each word of the generated caption.
翻译:自动图像字幕是深层学习的应用之一,它涉及计算机视觉和自然语言处理方面所做工作的结合,通常使用Encoder-Decoder结构进行。在这个项目中,我们实施并试验了多种多式图像字幕网络的口味,其中探索了ResNet101、DenseNet121和VGG19的CNN Encorders和关注制LSTM 代碼器。我们研究了波束尺寸的影响以及使用预先训练的字嵌入器,并将其与CNN 编码器和 RNN 脱coder 基本结构进行比较。我们的目标是利用各种评估指标,包括BLEU、CIDER、ROOGE和METEOR,分析每种方法的性能。我们还探索了利用视觉关注图(VAM)来突出图像中对预测生成的每个词的最大贡献的部分。