Scene graphs are powerful representations that encode images into their abstract semantic elements, i.e, objects and their interactions, which facilitates visual comprehension and explainable reasoning. On the other hand, commonsense knowledge graphs are rich repositories that encode how the world is structured, and how general concepts interact. In this paper, we present a unified formulation of these two constructs, where a scene graph is seen as an image-conditioned instantiation of a commonsense knowledge graph. Based on this new perspective, we re-formulate scene graph generation as the inference of a bridge between the scene and commonsense graphs, where each entity or predicate instance in the scene graph has to be linked to its corresponding entity or predicate class in the commonsense graph. To this end, we propose a heterogeneous graph inference framework allowing to exploit the rich structure within the scene and commonsense at the same time. Through extensive experiments, we show the proposed method achieves significant improvement over the state of the art.
翻译:场景图是将图像编码为抽象的语义元素(即物体及其相互作用)的强大图象,有助于视觉理解和解释推理。另一方面,普通知识图象是丰富的储存库,它能说明世界结构如何,以及一般概念如何相互作用。在本文中,我们提出这两种构造的统一配方,其中场景图被视为一种以图像为条件的可感知知识图的即时缩影。根据这一新的观点,我们重新制作场景图象生成,作为现场和普通感知图之间的桥梁的推论,在现场图中,每个实体或上游图象都必须与其相应的实体或普通感官图中的上游类联系起来。为此,我们提出一个可同时利用场景和共识图中丰富结构的多元图象框架。通过广泛的实验,我们展示了拟议的方法在艺术状态上取得了显著改进。