A main challenge in scene graph classification is that the appearance of objects and relations can be significantly different from one image to another. Previous works have addressed this by relational reasoning over all objects in an image, or incorporating prior knowledge into classification. Unlike previous works, we do not consider separate models for the perception and prior knowledge. Instead, we take a multi-task learning approach, where the classification is implemented as an attention layer. This allows for the prior knowledge to emerge and propagate within the perception model. By enforcing the model to also represent the prior, we achieve a strong inductive bias. We show that our model can accurately generate commonsense knowledge and that the iterative injection of this knowledge to scene representations leads to a significantly higher classification performance. Additionally, our model can be fine-tuned on external knowledge given as triples. When combined with self-supervised learning, this leads to accurate predictions with 1% of annotated images only.
翻译:场景图分类的一个主要挑战是,对象和关系的外观与图像之间的外观可能大不相同。 以前的作品通过图像中所有对象的关联推理或将先前的知识纳入分类来解决这个问题。 与以往的作品不同, 我们不考虑不同的认知和先前知识模式。 相反, 我们采取多任务学习方法, 将分类作为关注层来实施。 这使得先前的知识可以在感知模型中出现和传播。 通过执行模型来同时代表先前, 我们实现了强烈的感知偏差。 我们显示, 我们的模型可以准确地生成常识知识, 并将这种知识反复注入场景演示导致显著更高的分类性能。 此外, 我们的模式可以对外部知识进行三重微调。 当与自我监督的学习相结合时, 这导致精确预测, 只有1%的附加图象。