Scene Graph Generation (SGG) aims to extract <subject, predicate, object> relationships in images for vision understanding. Although recent works have made steady progress on SGG, they still suffer long-tail distribution issues that tail-predicates are more costly to train and hard to distinguish due to a small amount of annotated data compared to frequent predicates. Existing re-balancing strategies try to haddle it via prior rules but are still confined to pre-defined conditions, which are not scalable for various models and datasets. In this paper, we propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates in a low-resource way. The proposed CaCao can be applied in a plug-and-play fashion and automatically strengthen existing SGG to tackle the long-tailed problem. Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner. Comprehensive experiments on three benchmark datasets show that CaCao consistently boosts the performance of multiple scene graph generation models in a model-agnostic way. Moreover, our Epic achieves competitive performance on open-world predicate prediction.
翻译:场景图生成(SGG)旨在从图像中提取<subject, predicate, object>关系以进行视觉理解。虽然最近的研究在SGG上取得了稳定的进展,但它们仍然遭受长尾分布问题,即尾部谓词比较难以区分,由于与频繁谓词相比具有少量的标注数据,因此更难训练。现有的重新平衡策略尝试通过先前的规则来解决这个问题,但仍局限于预定义条件,这对于各种模型和数据集来说不具有可扩展性。在本文中,我们提出了一个跨模态谓词增强(CaCao)框架,其中学习了一个视觉提示的语言模型,以低资源方式生成多样化的细粒度谓词。提出的CaCao可以以即插即用的方式应用,并自动加强现有的SGG以解决长尾问题。基于此,我们进一步引入了一种新颖的交织跨模态提示方法用于开放世界谓词场景图生成(Epic),其中模型可以以零样本的方式推广到看不见的谓词上。对三个基准数据集进行的全面实验显示,CaCao始终以一种与模型无关的方式提高了多种场景图生成模型的性能。此外,我们的Epic在开放世界谓词预测方面实现了有竞争力的表现。