具有可调整关注视觉说明功能的等级式LSTMs (Hierarchical LSTMs with Adaptive Attention for Visual Captioning)

Recent progress has been made in using attention based encoder-decoder framework for image and video captioning. Most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., "gun" and "shooting") and non-visual words (e.g. "the", "a"). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of visual captioning. Furthermore, the hierarchy of LSTMs enables more complex representation of visual data, capturing information at different scales. To address these issues, we propose a hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning. Specifically, the proposed framework utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the caption generation. We initially design our hLSTMat for video captioning task. Then, we further refine it and apply it to image captioning task. To demonstrate the effectiveness of our proposed framework, we test our method on both video and image captioning tasks. Experimental results show that our approach achieves the state-of-the-art performance for most of the evaluation metrics on both tasks. The effect of important components is also well exploited in the ablation study.

翻译：最近,在对图像和视频字幕使用基于关注的编码器-代碼框架方面取得了进展。大多数现有的解码器对每个生成的单词都应用了关注机制,包括视觉词(例如,“枪”和“射击”)和非视觉词(例如,“枪”);然而,这些非视觉词可以使用自然语言模型很容易预测,而不必考虑视觉信号或注意。对非视觉词的注意机制可能会误导和减少视觉字幕的总体性能。此外,LSTMS的等级结构使得视觉数据的描述更为复杂,在不同尺度上捕捉信息。为了解决这些问题,我们建议对图像和视频说明采用具有适应性注意的等级LSTM方法。具体地说,拟议的框架利用空间或时间上的注意来选择特定区域或框架来预测相关词,而不会考虑视觉信息或语言背景信息。此外,LSTMS的等级结构可以同时考虑低层次的视觉信息和高语言背景信息,在不同尺度上采集信息。为了应对这些问题,我们提出了一种具有适应性意义的LSTM方法,我们最初设计了我们的实验性任务测试方法以显示我们随后的飞行任务。我们的任务的HLS和图象学结构上都展示了我们的任务。