Different objects in the same scene are more or less related to each other, but only a limited number of these relationships are noteworthy. Inspired by DETR, which excels in object detection, we view scene graph generation as a set prediction problem and propose an end-to-end scene graph generation model RelTR which has an encoder-decoder architecture. The encoder reasons about the visual feature context while the decoder infers a fixed-size set of triplets subject-predicate-object using different types of attention mechanisms with coupled subject and object queries. We design a set prediction loss performing the matching between the ground truth and predicted triplets for the end-to-end training. In contrast to most existing scene graph generation methods, RelTR is a one-stage method that predicts a set of relationships directly only using visual appearance without combining entities and labeling all possible predicates. Extensive experiments on the Visual Genome and Open Images V6 datasets demonstrate the superior performance and fast inference of our model.
翻译:在同一场景中,不同的对象或多或少相互关联,但其中只有为数不多的关系值得注意。受在物体探测方面优异的DETR的启发,我们将场景图生成视为一个设定的预测问题,并提出一个端到端图像生成模型,该模型具有编码器-解码器结构。关于视觉特征背景的编码器原因,而解码器则使用不同类型的关注机制,同时同时进行主题和对象查询,推断出一组固定规模的三重主题预测对象对象。我们设计一套预测损失,用于匹配地面真相和预测的三重目标,以进行端到端培训。与大多数现有的场景图生成方法不同,RelTR是一种单阶段方法,它预测一套关系,仅直接使用视觉外观,而不将实体和所有可能的上游标出。关于视觉基因组和开放图像V6数据集的广泛实验显示了我们模型的高级性能和快速推断。