Scene graph generation (SGG) is a fundamental task aimed at detecting visual relations between objects in an image. The prevailing SGG methods require all object classes to be given in the training set. Such a closed setting limits the practical application of SGG. In this paper, we introduce open-vocabulary scene graph generation, a novel, realistic and challenging setting in which a model is trained on a set of base object classes but is required to infer relations for unseen target object classes. To this end, we propose a two-step method that firstly pre-trains on large amounts of coarse-grained region-caption data and then leverages two prompt-based techniques to finetune the pre-trained model without updating its parameters. Moreover, our method can support inference over completely unseen object classes, which existing methods are incapable of handling. On extensive experiments on three benchmark datasets, Visual Genome, GQA, and Open-Image, our method significantly outperforms recent, strong SGG methods on the setting of Ov-SGG, as well as on the conventional closed SGG.
翻译:场景图形生成( SGG) 是一项基本任务,旨在检测图像中对象之间的视觉关系。 流行的 SGG 方法要求所有对象类别在训练组中提供。 这种封闭式设置限制了SGG的实际应用。 在本文中,我们引入了开放式词汇场景图形生成,这是一个新颖、现实和具有挑战性的设置,在一组基本对象类别上对模型进行了培训,但需要据此推断看不见目标对象类别的关系。 为此,我们提出了一个两步方法,首先在大量粗略区域覆盖数据上进行预演,然后利用两种即时技术在不更新参数的情况下对预培训模型进行微调。 此外,我们的方法可以支持对现有方法无法处理的完全看不见的物体类别进行推断。 在对三个基准数据集(视觉基因组、GQA和Open-Image)进行的广泛实验中,我们的方法大大超越了在设计Ov- SGG 上以及常规封闭的 SGGG 上最近的强大SGG 方法。