Scene graph generation has emerged as an important problem in computer vision. While scene graphs provide a grounded representation of objects, their locations and relations in an image, they do so only at the granularity of proposal bounding boxes. In this work, we propose the first, to our knowledge, framework for pixel-level segmentation-grounded scene graph generation. Our framework is agnostic to the underlying scene graph generation method and address the lack of segmentation annotations in target scene graph datasets (e.g., Visual Genome) through transfer and multi-task learning from, and with, an auxiliary dataset (e.g., MS COCO). Specifically, each target object being detected is endowed with a segmentation mask, which is expressed as a lingual-similarity weighted linear combination over categories that have annotations present in an auxiliary dataset. These inferred masks, along with a novel Gaussian attention mechanism which grounds the relations at a pixel-level within the image, allow for improved relation prediction. The entire framework is end-to-end trainable and is learned in a multi-task manner with both target and auxiliary datasets.
翻译:显像图生成已成为计算机视觉中的一个重要问题。 景象图的生成为对象、 其位置和图像中的关系提供了一种有根有据的表达方式, 它们只是在标注捆绑框的颗粒上才提供。 在这项工作中, 我们提出第一个, 据我们所知, 像素层分割框架, 以像素层为底部的场景图生成图。 我们的框架对底部图像生成方法具有不可知性, 并通过向目标场景图数据集( 如视觉基因组)传输和多任务学习, 解决目标场景图数据集( 如视觉基因组)中缺少分解说明的问题, 从而改进关联性预测。 具体地说, 所探测的每个目标对象都配有分层遮罩, 其表现为在辅助数据集中带有说明的类别上的一种语言- 相似性加权线性组合。 这些推断的遮罩, 加上一个新型高斯注意机制, 它将在图像中的像素层中建立关系, 从而改进关联性预测。 整个框架是端到端可训练的, 并且以多式的方式学习。