Recent advances on text-to-image generation have witnessed the rise of diffusion models which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent variable models to capture the dependency among discrete words and meanwhile pursue complex visual-language alignment in image captioning. In this paper, we break the deeply rooted conventions in learning Transformer-based encoder-decoder, and propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net). Technically, for each input image, we first search the semantically relevant sentences via cross-modal retrieval model to convey the comprehensive semantic information. The rich semantics are further regarded as semantic prior to trigger the learning of Diffusion Transformer, which produces the output sentence in a diffusion process. In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence in a cascaded manner. Furthermore, to stabilize the diffusion process, a new self-critical sequence training strategy is designed to guide the learning of SCD-Net with the knowledge of a standard autoregressive Transformer model. Extensive experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task. Source code is available at \url{https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/scdnet}.
翻译:文本到图像生成的最新进步见证了作为强大基因化模型的传播模型的崛起。然而,利用这些潜伏变量模型来捕捉离散单词之间的依赖性,同时在图像字幕中追求复杂的视觉语言调整。在本文中,我们在学习基于变异器的编码解码器(Decoder-decoder)的过程中打破了根深蒂固的常规,并提议了一个新的基于图像字幕的传播模型模式,即Semantic-Sotitional Difnational Net(SCD-Net) 。从技术上讲,我们首先通过跨模式检索模型来查找具有语义相关性的句子,以传递全面的语义信息。丰富的语义学学被进一步视为语义学学,以启动在扩散过程中产生输出句的Difmult 变异变器的学习。在SCD-Net中,多种Difmultion Gerverer 结构堆叠起来,以逐步加强产出句子,以更好的视觉语言调和语言调调的方式。此外,为了稳定传播过程,新的自我批评的序列分析分析策略培训策略战略,在SCD-devicreglal lial listrational listrational livestidustrual liglemental deal deal lixal lixal liftal laftal taduction sal liftal laftal laftital sal sal laftal laftmal laftmex sil sil sil sal lemental 上,设计了Sildalking labild