In this paper, we propose an end-to-end CNN-LSTM model for generating descriptions for sequential images with a local-object attention mechanism. To generate coherent descriptions, we capture global semantic context using a multi-layer perceptron, which learns the dependencies between sequential images. A paralleled LSTM network is exploited for decoding the sequence descriptions. Experimental results show that our model outperforms the baseline across three different evaluation metrics on the datasets published by Microsoft.
翻译:在本文中,我们提出了一个端对端CNN-LSTM模型,用于用本地对象注意机制生成相继图像描述。为了生成一致描述,我们使用多层光谱来捕捉全球语义背景,该光谱可以了解相继图像之间的依赖性。一个平行的LSTM网络被用来解码序列描述。实验结果表明,我们的模型在微软出版的数据集上的三个不同的评价度量标准中,比基线的大小要高。