Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g., collapsing diverse human walk on/ sit on/lay on beach into human on beach. Given such SGG, the down-stream tasks such as VQA can hardly infer better scene structures than merely a bag of objects. However, debiasing in SGG is not trivial because traditional debiasing methods cannot distinguish between the good and bad bias, e.g., good context prior (e.g., person read book rather than eat) and bad long-tailed bias (e.g., behind/in front of collapsed to near). In this paper, we present a novel SGG framework based on causal inference but not the conventional likelihood. We first build a causal graph for SGG, and perform traditional biased training with the graph. Then, we propose to draw the counterfactual causality from the trained graph to infer the effect from the bad bias, which should be removed. In particular, we use Total Direct Effect as the proposed final predicate score for unbiased SGG. Note that our framework is agnostic to any SGG model and thus can be widely applied in the community who seeks unbiased predictions. By using the proposed Scene Graph Diagnosis toolkit on the SGG benchmark Visual Genome and several prevailing models, we observed significant improvements over the previous state-of-the-art methods.
翻译:今天的场景图表生成(SGG)任务仍然远远不实际,主要原因是严重的培训偏差,例如,在海滩上,人们在海滩上行走/坐坐/坐在海滩上,在海滩上进入人类海滩。鉴于这种SGG, VQA等下游任务几乎不能比一袋物体更能推断出更好的场景结构。然而,SGG的偏差并不是微不足道,因为传统的偏差方法无法区分好与坏的偏差,例如,以前的好环境(例如,人阅读书而不是吃东西)和不良的长尾偏差(例如,人阅读书而不是吃东西)和不良的长尾偏差(例如,在接近的海滩上/后面/前面倒塌)。在本文中,我们提出了一个新的SGGG框架,以因果关系为基础,但不是常规的可能性。我们首先为SGGGA制作一个因果关系图,用图表进行传统的偏差训练。然后,我们提议从经过训练的图表中选取相反的因果关系来推断错误的偏差的偏差后果,应该消除。特别是,我们用完全直接效果作为拟议的最终的成绩评分数模型,用于不偏差的SGGGGGGGFS-GFIFIFIFB框架,因此要用一个重大的预测,我们的框架是用一个重大的预测。