The task of scene graph generation entails identifying object entities and their corresponding interaction predicates in a given image (or video). Due to the combinatorially large solution space, existing approaches to scene graph generation assume certain factorization of the joint distribution to make the estimation feasible (e.g., assuming that objects are conditionally independent of predicate predictions). However, this fixed factorization is not ideal under all scenarios (e.g., for images where an object entailed in interaction is small and not discernible on its own). In this work, we propose a novel framework for scene graph generation that addresses this limitation, as well as introduces dynamic conditioning on the image, using message passing in a Markov Random Field. This is implemented as an iterative refinement procedure wherein each modification is conditioned on the graph generated in the previous iteration. This conditioning across refinement steps allows joint reasoning over entities and relations. This framework is realized via a novel and end-to-end trainable transformer-based architecture. In addition, the proposed framework can improve existing approach performance. Through extensive experiments on Visual Genome and Action Genome benchmark datasets we show improved performance on the scene graph generation.
翻译:场景图生成的任务涉及在特定图像(或视频)中确定对象实体及其相应的互动前提。由于组合式的大型解决方案空间,现有的场景图生成方法假定了联合分布的某种因子化,以使估算成为可行(例如,假设对象有条件地独立于上游预测);然而,这种固定的因子化并不是在所有假设情景下都理想的(例如,对于相互作用所涉对象较小且本身无法识别的图像而言)。在这项工作中,我们提议了一个针对这一局限性的场景图生成新框架,并采用在Markov随机字段中传递的信息对图像进行动态调节。这是作为迭代改进程序实施的,其中每项修改都以先前迭代图生成的图为条件。这种调整使对实体和关系的联合推理得以实现。这一框架通过一个新颖的、端到端的、基于培训的变异器结构来实现。此外,拟议的框架可以改进现有方法的性能。通过对视觉基因组和动作基因组基准数据集的广泛实验,我们展示了场景图生成的性能。