Image captioning by the encoder-decoder framework has shown tremendous advancement in the last decade where CNN is mainly used as encoder and LSTM is used as a decoder. Despite such an impressive achievement in terms of accuracy in simple images, it lacks in terms of time complexity and space complexity efficiency. In addition to this, in case of complex images with a lot of information and objects, the performance of this CNN-LSTM pair downgraded exponentially due to the lack of semantic understanding of the scenes presented in the images. Thus, to take these issues into consideration, we present CNN-GRU encoder decode framework for caption-to-image reconstructor to handle the semantic context into consideration as well as the time complexity. By taking the hidden states of the decoder into consideration, the input image and its similar semantic representations is reconstructed and reconstruction scores from a semantic reconstructor are used in conjunction with likelihood during model training to assess the quality of the generated caption. As a result, the decoder receives improved semantic information, enhancing the caption production process. During model testing, combining the reconstruction score and the log-likelihood is also feasible to choose the most appropriate caption. The suggested model outperforms the state-of-the-art LSTM-A5 model for picture captioning in terms of time complexity and accuracy.
翻译:在过去十年里,CNN主要用作编码器,LSTM主要用作解码器。尽管在简单图像的准确性方面取得了令人印象深刻的成就,但在时间复杂性和空间复杂度方面却缺乏效率。此外,在使用大量信息和物体的复杂图像的情况下,由于对图像中显示的场景缺乏语义理解,CNN-LSTM配对的性能大幅下降。因此,为了考虑这些问题,我们提出了CNNNG-GRU的解码器用于字幕到图像重建的解码框架,以便既考虑语义背景的准确性,也考虑时间的复杂性。通过考虑到解码器的隐藏状态,对投入图像及其类似的语义表达方式进行了重建,并在模型培训期间使用了语义重建分数,以评估生成字幕的质量。因此,解码器得到改进了语义信息,加强了字幕制作过程。在模型测试期间,将重建分数和日志结构图的精度相结合,从而选择了最可行的模型。在最可行的模型中选择了模型的精度,将重建分数和图表的精度与方向图的精度加以选择。