Prior work in scene graph generation requires categorical supervision at the level of triplets - subjects and objects, and predicates that relate them, either with or without bounding box information. However, scene graph generation is a holistic task: thus holistic, contextual supervision should intuitively improve performance. In this work, we explore how linguistic structures in captions can benefit scene graph generation. Our method captures the information provided in captions about relations between individual triplets, and context for subjects and objects (e.g. visual properties are mentioned). Captions are a weaker type of supervision than triplets since the alignment between the exhaustive list of human-annotated subjects and objects in triplets, and the nouns in captions, is weak. However, given the large and diverse sources of multimodal data on the web (e.g. blog posts with images and captions), linguistic supervision is more scalable than crowdsourced triplets. We show extensive experimental comparisons against prior methods which leverage instance- and image-level supervision, and ablate our method to show the impact of leveraging phrasal and sequential context, and techniques to improve localization of subjects and objects.
翻译:在现场图表生成中,先前的工作要求对三重对象 -- -- 对象和对象,以及与它们相关的上游数据 -- -- 进行绝对监督,无论是否带有捆绑框信息。然而,现场图表生成是一项全面的任务:因此,整体的、背景的监督应当直观地改进性能。在这项工作中,我们探索了字幕中的语言结构如何有利于现场图形生成。我们的方法捕捉了关于三重对象关系以及对象和对象背景(例如提到视觉属性)的字幕中提供的信息。标题是比三重对象的严格监督类型,而不是三重对象,因为三重对象的人类附加说明对象和对象的详尽清单与标题中的名词之间的一致性是薄弱的。然而,鉴于网络上多种多式数据来源(例如带图像和字幕的博客文章张贴),语言监督比众载三重对象之间的关系更加容易扩展。我们展示了与先前利用实例和图像层面监督的方法的广泛实验性比较,并扩大了我们展示利用圆柱形和相背景的影响的方法,以及改进主题和对象本地化的技术。