Observing a set of images and their corresponding paragraph-captions, a challenging task is to learn how to produce a semantically coherent paragraph to describe the visual content of an image. Inspired by recent successes in integrating semantic topics into this task, this paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework, which couples a visual extractor with a deep topic model to guide the learning of a language model. To capture the correlations between the image and text at multiple levels of abstraction and learn the semantic topics from images, we design a variational inference network to build the mapping from image features to textual captions. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model, including Long Short-Term Memory (LSTM) and Transformer, and jointly optimized. Experiments on public datasets demonstrate that the proposed models, which are competitive with many state-of-the-art approaches in terms of standard evaluation metrics, can be used to both distill interpretable multi-layer semantic topics and generate diverse and coherent captions. We release our code at https://github.com/DandanGuo1993/VTCM-based-image-paragraph-caption.git
翻译:观察一组图像及其相应的段落内容,这是一项艰巨的任务,即学会如何制作一个语义一致的段落来描述图像的视觉内容。在将语义学专题纳入这项任务的最近成功经验的启发下,本文件开发了一个插插和剧层分层制成图像段落生成框架,这个框架将一个视觉提取器和深层专题模型结合在一起,用以指导语言模型的学习。为了在多个抽象层次上捕捉图像和文字之间的关联,并从图像中学习语义学专题,我们设计了一个变异推论网络,从图像特征到文字说明来建立绘图。为指导段落生成,学习的等级专题和视觉特征被纳入语言模型,包括长期短期记忆(LSTM)和变异器,并共同优化。对公共数据集的实验表明,在标准评价指标方面与许多州-艺术方法具有竞争力的拟议模型,可以用于提取可解释的多层语义语义专题,并生成多样化和连贯的字幕。我们发布了在 http://gi-Mubast-Mexim-comtion的代码。