Understanding visual relationships involves identifying the subject, the object, and a predicate relating them. We leverage the strong correlations between the predicate and the (subj,obj) pair (both semantically and spatially) to predict the predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships, but complicates learning since the semantic space of visual relationships is huge and the training data is limited, especially for the long-tail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a (subj,obj) pair. Then, we distill the knowledge into a deep model to achieve better generalization. Our experimental results on the Visual Relationship Detection (VRD) and Visual Genome datasets suggest that with this linguistic knowledge distillation, our model outperforms the state-of-the-art methods significantly, especially when predicting unseen relationships (e.g., recall improved from 8.45% to 19.17% on VRD zero-shot testing set).
翻译:我们利用上游和(subj,obj)对(subj,obj)对(subj,obj)对(语义和空间上)对(subj,obj)对的强烈关联来预测以主体和对象为条件的上游。对这三个实体进行联合建模,更准确地反映它们之间的关系,但由于视觉关系中的语义空间巨大,培训数据有限,使学习复杂化,特别是对于很少有实例的长尾关系。为了克服这一点,我们利用语言统计知识来规范视觉模型学习。我们从培训说明(内部知识)和公开的文本(例如维基百科(外部知识))中获取语言知识,计算给某个对象和对象(subj,obj)对的前提的有条件概率分布。然后,我们将知识注入一个更深层的模型,以更好地实现普遍性。我们在视觉关系探测和视觉基因组数据集方面的实验结果表明,通过这种语言知识蒸馏,我们的模型大大超越了状态方法,特别是在预测前方关系(例如,回顾从19-7%测试到19)。