Image captioning models generally lack the capability to take into account user interest, and usually default to global descriptions that try to balance readability, informativeness, and information overload. On the other hand, VQA models generally lack the ability to provide long descriptive answers, while expecting the textual question to be quite precise. We present a method to control the concepts that an image caption should focus on, using an additional input called the guiding text that refers to either groundable or ungroundable concepts in the image. Our model consists of a Transformer-based multimodal encoder that uses the guiding text together with global and object-level image features to derive early-fusion representations used to generate the guided caption. While models trained on Visual Genome data have an in-domain advantage of fitting well when guided with automatic object labels, we find that guided captioning models trained on Conceptual Captions generalize better on out-of-domain images and guiding texts. Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets, and that increased style diversity (even without increasing vocabulary size) is a key factor for improved performance.
翻译:图像字幕模型一般缺乏考虑到用户兴趣的能力,通常默认于试图平衡可读性、信息性和信息超载的全球描述。 另一方面, VQA 模型一般缺乏提供长描述性答案的能力,同时期望文字问题非常精确。 我们提出了一个方法来控制图像字幕应该关注的概念, 使用被称为指导文本的额外输入, 即指向图像中可定位或不可定位的概念。 我们的模型由基于变换的多式联运编码器组成, 该编码器将指导文本与全球和对象级图像特征一起用于生成用于生成导引字幕的早期聚合显示。 虽然在视觉基因组数据方面受过培训的模型在使用自动对象标签时, 具有一种内在的优势, 在使用自动对象标签时, 我们发现在概念标题上受过培训的模型在外置图像和指导文本上会更好。 我们的人类评价结果表明, 尝试在虚拟方向上进行导导导图说明需要访问大型、 不受限制的图像集, 并且增加样式多样性(即使不增加词汇尺寸) 是改进性能的一个关键因素。