Existing research addresses scene graph generation (SGG) -- a critical technology for scene understanding in images -- from a detection perspective, i.e., objects are detected using bounding boxes followed by prediction of their pairwise relationships. We argue that such a paradigm causes several problems that impede the progress of the field. For instance, bounding box-based labels in current datasets usually contain redundant classes like hairs, and leave out background information that is crucial to the understanding of context. In this work, we introduce panoptic scene graph generation (PSG), a new problem task that requires the model to generate a more comprehensive scene graph representation based on panoptic segmentations rather than rigid bounding boxes. A high-quality PSG dataset, which contains 49k well-annotated overlapping images from COCO and Visual Genome, is created for the community to keep track of its progress. For benchmarking, we build four two-stage baselines, which are modified from classic methods in SGG, and two one-stage baselines called PSGTR and PSGFormer, which are based on the efficient Transformer-based detector, i.e., DETR. While PSGTR uses a set of queries to directly learn triplets, PSGFormer separately models the objects and relations in the form of queries from two Transformer decoders, followed by a prompting-like relation-object matching mechanism. In the end, we share insights on open challenges and future directions.
翻译:现有研究用现场图生成(SGG) -- -- 现场图生成(SGG) -- -- 这是在图像中进行现场理解的一种关键技术 -- -- 从探测角度出发,即用捆绑盒检测物体,然后预测其配对关系。我们认为,这种范例造成了阻碍实地进展的若干问题。例如,当前数据集中基于盒绑定的标签通常包含多余的类别,如毛发,留下背景信息,对于理解背景理解至关重要。在这项工作中,我们引入了全景图生成(PSGSG),这是一个新问题任务,需要模型根据全景图生成更全面的场图示,其基础是光谱分割,而不是僵硬的捆绑框。高质量的PSG数据集包含来自CO和视觉基因组的49k高附加说明的重叠图像,供社区跟踪其进展情况。在基准方面,我们建立了4个两阶段的开放基线,与SGGS的经典方法相修改,以及两个一阶段的基线称为PSGTR和PSGFormer,它以高效的变压探测器为基础,即DTR,直接学习了PSG-SG-TR的三重图像,同时,同时使用了两个方向。