Objects in a scene are not always related. The execution efficiency of the one-stage scene graph generation approaches are quite high, which infer the effective relation between entity pairs using sparse proposal sets and a few queries. However, they only focus on the relation between subject and object in triplet set subject entity, predicate entity, object entity, ignoring the relation between subject and predicate or predicate and object, and the model lacks self-reasoning ability. In addition, linguistic modality has been neglected in the one-stage method. It is necessary to mine linguistic modality knowledge to improve model reasoning ability. To address the above-mentioned shortcomings, a Self-reasoning Transformer with Visual-linguistic Knowledge (SrTR) is proposed to add flexible self-reasoning ability to the model. An encoder-decoder architecture is adopted in SrTR, and a self-reasoning decoder is developed to complete three inferences of the triplet set, s+o-p, s+p-o and p+o-s. Inspired by the large-scale pre-training image-text foundation models, visual-linguistic prior knowledge is introduced and a visual-linguistic alignment strategy is designed to project visual representations into semantic spaces with prior knowledge to aid relational reasoning. Experiments on the Visual Genome dataset demonstrate the superiority and fast inference ability of the proposed method.
翻译:场景中的物体并非始终相关。 单级场景图生成方法的执行效率相当高,这可以推断出使用稀有的标本和几个查询工具的实体对对口之间的有效关系。 但是,它们只侧重于三重设定主题实体、上游实体、对象实体、无视标本与上游或上游及对象之间的关系,模型缺乏自我推理能力。此外,单级方法忽视了语言模式。必须利用语言模式知识来提高模型推理能力。为了解决上述缺陷,建议使用具有视觉语言知识的自我推理变异器为模型添加灵活的自我推理能力。在SrTR中采用了编码-解码结构,开发了自我解码解码器以完成三重组合、 s+o-p、 s+p-po-o和p+o-o-s的三种推理学推理能力。 大规模培训前图像基础模型启发了上述缺陷,提出了具有视觉语言感知识的自酌变变变变变变变型器,在先前的直观性图像分析模型上引入了先变比前的图像分析系统。