Paragraph-style image captions describe diverse aspects of an image as opposed to the more common single-sentence captions that only provide an abstract description of the image. These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering. Moreover, this textual information is complementary with visual information present in the image because it can discuss both more abstract concepts and more explicit, intermediate symbolic information about objects, events, and scenes that can directly be matched with the textual question and copied into the textual answer (i.e., via easier modality match). Hence, we propose a combined Visual and Textual Question Answering (VTQA) model which takes as input a paragraph caption as well as the corresponding image, and answers the given question based on both inputs. In our model, the inputs are fused to extract related information by cross-attention (early fusion), then fused again in the form of consensus (late fusion), and finally expected answers are given an extra score to enhance the chance of selection (later fusion). Empirical results show that paragraph captions, even when automatically generated (via an RL-based encoder-decoder model), help correctly answer more visual questions. Overall, our joint model, when trained on the Visual Genome dataset, significantly improves the VQA performance over a strong baseline model.
翻译:段落风格的图像标题描述图像的不同方面,而不是比较常见的单一句子标题,这些段落标题只能提供图像的抽象描述。因此,这些段落标题可以包含大量图像信息,用于视觉问题回答等任务。此外,这种文字信息与图像中的视觉信息是互补的,因为它可以讨论更抽象的概念和关于对象、事件和场景的更清晰的中间象征性信息,可以直接与文本问题相对应并抄录到文本解答(例如,通过更简单的模式匹配)。因此,我们建议了一个视觉和文本问答(VTQA)组合模型,该模型可以输入段落标题和相应的图像,并回答基于两种输入的问题。在我们模型中,输入的文字信息与通过交叉注意(早期融合)提取相关信息,然后以共识的形式(最新的融合)再次组合,最后预期的答案得到额外的评分,以提高选择机会(更精确的融合)。因此,Empicalalal结果显示段落标题坚固有力,即使自动生成了一个段落标题以及相应的图像模型,也就是当我们所培训的图像级联合解读取的模型时,也大大地改进了我们的视觉模型。