State-of-the-art image captioners can generate accurate sentences to describe images in a sequence to sequence manner without considering the controllability and interpretability. This, however, is far from making image captioning widely used as an image can be interpreted in infinite ways depending on the target and the context at hand. Achieving controllability is important especially when the image captioner is used by different people with different way of interpreting the images. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by capturing the co-dependence between Part-Of-Speech tags and semantics. Our model decouples direct dependence between successive variables. In this way, it allows the decoder to exhaustively search through the latent Part-Of-Speech choices, while keeping decoding speed proportional to the size of the POS vocabulary. Given a control signal in the form of a sequence of Part-Of-Speech tags, we propose a method to generate captions through a Transformer network, which predicts words based on the input Part-Of-Speech tag sequences. Experiments on publicly available datasets show that our model significantly outperforms state-of-the-art methods on generating diverse image captions with high qualities.
翻译:最新图像说明符可以生成准确的句子, 以序列顺序描述图像, 而不考虑可控性和可解释性 。 但是, 这远远没有让图像说明被广泛使用, 能够根据目标和手头的上下文以无限的方式解释图像说明。 实现可控性非常重要 。 当图像说明符被不同的人使用, 以不同的方式解释图像时 。 在本文中, 我们为图像说明引入了一个新颖的框架, 通过捕捉部分语音标记和语义之间的共依赖性来生成不同描述 。 我们的模型分解了相继变量之间的直接依赖性 。 这样, 它允许解译器通过潜在部分 Speech 选择进行详尽的搜索, 同时保持与 POS 词汇大小成比例的解码速度 。 基于 Part- of- speech 标记序列的排序控制信号, 我们建议了一种通过变异网络生成字幕的方法, 以输入部分语音标记序列为基础预测单词 。 在可公开获取的图像质量上进行实验, 显示我们显著的图像格式 。