Learning from image-text data has demonstrated recent success for many recognition tasks, yet is currently limited to visual features or individual visual concepts such as objects. In this paper, we propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph. To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph. Further, we design a Transformer-based model to predict these "pseudo" labels via a masked token prediction task. Learning from only image-sentence pairs, our model achieves 30% relative gain over a latest method trained with human-annotated unlocalized scene graphs. Our model also shows strong results for weakly and fully supervised scene graph generation. In addition, we explore an open-vocabulary setting for detecting scene graphs, and present the first result for open-set scene graph generation. Our code is available at https://github.com/YiwuZhong/SGG_from_NLS.
翻译:从图像文本数据中学习的图像文本数据显示,许多识别任务最近取得了成功,但目前仅限于视觉特征或对象等单个视觉概念。在本文中,我们提出了从图像-感官配对中学习的第一批方法之一,从图像-感官配对中提取本地天体及其在图像中的关系的图形表示,称为场景图。为了缩小图像和文本之间的差距,我们利用一个现成天体探测器来识别和定位天体实例,将所检测到的区域标签与从字幕中提取出来的概念相匹配,从而为学习场景图创建“假体”标签。此外,我们设计了一个基于变换器的模型,通过掩码符号预测任务来预测这些“假体”标签。从仅使用图像-感官配对的图像组合中学习,我们的模型在用人类附加说明的未经定位的场景图所训练的最新方法中取得了30%的相对收益。我们的模型还展示了薄弱和完全受监督的场景图生成的有力结果。此外,我们探索了一个用于探测场景图的开放词汇设置,并展示开场图生成的首项结果。我们的代码可在 http://GRSHU/SHRWRMR.我们的代码在 。