It is well believed that the higher uncertainty in a word of the caption, the more inter-correlated context information is required to determine it. However, current image captioning methods usually consider the generation of all words in a sentence sequentially and equally. In this paper, we propose an uncertainty-aware image captioning framework, which parallelly and iteratively operates insertion of discontinuous candidate words between existing words from easy to difficult until converged. We hypothesize that high-uncertainty words in a sentence need more prior information to make a correct decision and should be produced at a later stage. The resulting non-autoregressive hierarchy makes the caption generation explainable and intuitive. Specifically, we utilize an image-conditioned bag-of-word model to measure the word uncertainty and apply a dynamic programming algorithm to construct the training pairs. During inference, we devise an uncertainty-adaptive parallel beam search technique that yields an empirically logarithmic time complexity. Extensive experiments on the MS COCO benchmark reveal that our approach outperforms the strong baseline and related methods on both captioning quality as well as decoding speed.
翻译:人们普遍认为,标题单词的不确定性较高,需要更复杂的背景信息来确定。然而,目前的图像字幕方法通常考虑在句子中按顺序和等同地生成所有单词。在本文中,我们提议了一个具有不确定性的图像字幕框架,在现有的单词之间平行和迭接地插入不连贯的候选词,从容易到困难到交汇到交汇。我们假设,一个句子中高不确定性的单词需要更事先的信息才能做出正确的决定,并应在稍后阶段产生。由此形成的非侵略性等级结构使得标题生成能够解释和直观。具体地说,我们使用一个带有图像的字包式模型来衡量不确定性,并运用动态的编程算法来构建培训配对。在推论中,我们设计了一个具有经验性对数时间复杂性的不确定性-适应性平行线搜索技术。关于MS COCOCO基准的广泛实验显示,我们的方法超越了在标注质量和解码速度两方面的强有力的基线和相关方法。