Relation prediction among entities in images is an important step in scene graph generation (SGG), which further impacts various visual understanding and reasoning tasks. Existing SGG frameworks, however, require heavy training yet are incapable of modeling unseen (i.e.,zero-shot) triplets. In this work, we stress that such incapability is due to the lack of commonsense reasoning,i.e., the ability to associate similar entities and infer similar relations based on general understanding of the world. To fill this gap, we propose CommOnsense-integrAted sCenegrapHrElation pRediction (COACHER), a framework to integrate commonsense knowledge for SGG, especially for zero-shot relation prediction. Specifically, we develop novel graph mining pipelines to model the neighborhoods and paths around entities in an external commonsense knowledge graph, and integrate them on top of state-of-the-art SGG frameworks. Extensive quantitative evaluations and qualitative case studies on both original and manipulated datasets from Visual Genome demonstrate the effectiveness of our proposed approach.
翻译:图像中实体之间的关系预测是现场图表生成的一个重要步骤,进一步影响到各种视觉理解和推理任务。但是,现有的SGG框架需要大量培训,但无法模拟看不见的(零发的)三重模型。在这项工作中,我们强调,这种能力无能的原因是缺乏常识推理,即根据对世界的普遍理解将类似实体联系起来并推断类似关系的能力。为填补这一空白,我们提议Comonsense-Integrated senegagrapHrlation Pressional precretion(COACHER),这是一个将SGG的常识知识,特别是零发关系预测的常识知识整合起来的框架。具体地说,我们开发了新的图表采矿管道,以模拟外部常识知识图中围绕实体的周边和路径,并将它们纳入最先进的SGGG框架之上。关于视觉基因组原始和操纵数据集的广泛定量评价和定性案例研究显示了我们拟议方法的有效性。