The goal of scene graph generation is to predict a graph from an input image, where nodes correspond to identified and localized objects and edges to their corresponding interaction predicates. Existing methods are trained in a fully supervised manner and focus on message passing mechanisms, loss functions, and/or bias mitigation. In this work we introduce a simple-yet-effective self-supervised relational alignment regularization designed to improve the scene graph generation performance. The proposed alignment is general and can be combined with any existing scene graph generation framework, where it is trained alongside the original model's objective. The alignment is achieved through distillation, where an auxiliary relation prediction branch, that mirrors and shares parameters with the supervised counterpart, is designed. In the auxiliary branch, relational input features are partially masked prior to message passing and predicate prediction. The predictions for masked relations are then aligned with the supervised counterparts after the message passing. We illustrate the effectiveness of this self-supervised relational alignment in conjunction with two scene graph generation architectures, SGTR and Neural Motifs, and show that in both cases we achieve significantly improved performance.
翻译:场景图生成的目的是从输入图像中预测一个图表,其中节点与已查明的和局部的物体和边缘相对应,与相应的互动上游相对应。现有方法经过充分监督的培训,重点是信息传递机制、损失功能和/或偏差缓解。在这项工作中,我们引入了一个简单而有效的自我监督关系调节调节,目的是改进图像生成的性能。提议的对齐是一般性的,可以与任何现有的场景图生成框架相结合,并与之结合,根据原始模型的目标对其进行培训。这种对齐是通过蒸馏实现的,在蒸馏过程中,设计了一个辅助关系预测分支,与受监督的对应方进行镜像和共享参数。在辅助分支中,在信息传递和上游预测之前,对关联输入特征进行了部分遮掩。随后,对蒙蔽关系的预测与信息传递后受监督的对应方进行了调整。我们介绍了这种自我监督的关联调节与两个场景图生成结构(SGTR和Neural Motifs)的功效,并表明在这两种情况下,我们取得了显著的性。