SGEITL: 视觉常识理由的场景图强化图像文本学习 (SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning)

Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graphs in commonsense reasoning. To exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph. Moreover, we introduce a method to train and generate domain-relevant visual scene graphs using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show a significant performance boost compared with the state-of-the-art methods and prove the efficacy of each proposed component.

翻译：解答图像的复杂问题是机器智能的一个雄心勃勃的目标,它要求共同理解图像、文本和常识知识,以及强大的推理能力。最近,多式联运变异器在视觉常识理性(VCR)任务中取得了巨大进展,通过多层次的交叉模式关注,共同理解视觉对象和文字符号。然而,这些方法并不利用对回答复杂的常见问题至关重要的场景结构的丰富结构和对象之间的相互作用。我们提议了一个场景图集强化图像-文字学习(SGEITL)框架,将视觉场景图集纳入常识推理。为了在模型结构一级利用场景图结构结构结构结构结构结构结构,我们提议了一个多动画图变异器,使跳楼之间的注意力互动正规化。在培训前,提出了一种场景图集预培训方法,以利用在视觉场景图中提取的知识结构。此外,我们采用了一种方法,用微调方式用文字图解来培训和生成与域有关的视觉图示图。在VCRR和其他任务中进行广泛的实验,并用每个拟议中的效能显示显著的推力,与州图集比较。