The conventional encoder-decoder framework for image captioning generally adopts a single-pass decoding process, which predicts the target descriptive sentence word by word in temporal order. Despite the great success of this framework, it still suffers from two serious disadvantages. Firstly, it is unable to correct the mistakes in the predicted words, which may mislead the subsequent prediction and result in error accumulation problem. Secondly, such a framework can only leverage the already generated words but not the possible future words, and thus lacks the ability of global planning on linguistic information. To overcome these limitations, we explore a universal two-pass decoding framework, where a single-pass decoding based model serving as the Drafting Model first generates a draft caption according to an input image, and a Deliberation Model then performs the polishing process to refine the draft caption to a better image description. Furthermore, inspired from the complementarity between different modalities, we propose a novel Cross Modification Attention (CMA) module to enhance the semantic expression of the image features and filter out error information from the draft captions. We integrate CMA with the decoder of our Deliberation Model and name it as Cross Modification Attention based Deliberation Model (CMA-DM). We train our proposed framework by jointly optimizing all trainable components from scratch with a trade-off coefficient. Experiments on MS COCO dataset demonstrate that our approach obtains significant improvements over single-pass decoding baselines and achieves competitive performances compared with other state-of-the-art two-pass decoding based methods.
翻译:用于图像字幕的常规编码代码框架通常采用单一密码解码程序,按时间顺序对目标描述词进行预测。尽管这一框架取得了巨大成功,但它仍然有两个严重的缺点。首先,它无法纠正预测词中的错误,这可能误导随后的预测,并导致错误积累问题。第二,这样一个框架只能利用已经生成的词句,而不是未来可能使用的词句,从而缺乏全球语言信息规划的能力。为了克服这些限制,我们探索一个通用双密码框架,在这个框架中,一个以单行解码为基础的模式作为起草模型,首先根据输入图像生成一个标题草案,然后,一个解放模式进行抛光进程,以完善标题草案,从而导致更好的图像描述。此外,由于不同模式之间的互补性,我们提议了一个新的跨校正注意模块,以加强图像特征的语义表达,并过滤来自语言信息草案的错误信息。我们将CMA与我们的分解码模式的分解码模式结合起来,并将它命名为根据投入图像图像图像的双轨解码脱码模式,而将其命名为我们基于升级的标准化的标准化标准,从而实现我们标准化的标准化的标准化的标准化的标准化的升级的升级的升级的模型。