Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As an image can be described in infinite ways depending on the goal and the context at hand, a higher degree of controllability is needed to apply captioning algorithms in complex scenarios. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by allowing both grounding and controllability. Given a control signal in the form of a sequence or set of image regions, we generate the corresponding caption through a recurrent architecture which predicts textual chunks explicitly grounded on regions, following the constraints of the given control. Experiments are conducted on Flickr30k Entities and on COCO Entities, an extended version of COCO in which we add grounding annotations collected in a semi-automatic manner. Results demonstrate that our method achieves state of the art performances on controllable image captioning, in terms of caption quality and diversity. Code will be made publicly available.
翻译:目前的字幕方法可以描述使用黑盒子结构图象的图像, 其行为很难控制, 从外部可以解释。 由于图像可以根据目标和当前背景以无限的方式描述, 因此在复杂的情景中应用字幕算法需要更高程度的控制性。 在本文中, 我们为图像字幕引入了一个新颖的框架, 通过既允许地基又允许控制性来产生不同描述。 根据一个控制信号, 以一系列图像区域的形式, 我们通过一个经常结构生成相应的字幕, 经常结构预测在特定控制的限制下明确基于区域的文本块。 实验在Flickr30k 实体和COCO 实体进行, 这是一种扩展的COCO 模型, 我们以半自动的方式添加收集的地面说明。 结果显示, 我们的方法在可控图像字幕质量和多样性方面达到了艺术性能的状态。 代码将被公布 。