Reasoning about the relationships between object pairs in images is a crucial task for holistic scene understanding. Most of the existing works treat this task as a pure visual classification task: each type of relationship or phrase is classified as a relation category based on the extracted visual features. However, each kind of relationships has a wide variety of object combination and each pair of objects has diverse interactions. Obtaining sufficient training samples for all possible relationship categories is difficult and expensive. In this work, we propose a natural language guided framework to tackle this problem. We propose to use a generic bi-directional recurrent neural network to predict the semantic connection between the participating objects in the relationship from the aspect of natural language. The proposed simple method achieves the state-of-the-art on the Visual Relationship Detection (VRD) and Visual Genome datasets, especially when predicting unseen relationships (e.g. recall improved from 76.42% to 89.79% on VRD zero-shot testing set).
翻译:对图像中对象配对之间的关系进行解释是全面了解场景的关键任务。 大部分现有作品将此任务视为纯粹的视觉分类任务: 每种类型的关系或短语都根据提取的视觉特征归类为关系类别。 但是, 每种类型的关系都有各种各样的物体组合, 每一对对象都有不同的相互作用。 为所有可能的关系类别获得足够的培训样本是困难和昂贵的。 在这项工作中, 我们提出了一个自然语言指导框架来解决这一问题。 我们提议使用一个通用的双向经常性神经网络从自然语言的方面预测参与对象之间的语义联系。 提议的简单方法可以实现视觉关系探测( VRD) 和视觉基因组数据集方面的最新艺术, 特别是在预测不可见关系类别时( 例如, 记得VRD零光测试集上从76.42%提高到89.79% ) 。