We present a novel Tensor Composition Net (TCN) to predict visual relationships in images. Visual Relationship Prediction (VRP) provides a more challenging test of image understanding than conventional image tagging and is difficult to learn due to a large label-space and incomplete annotation. The key idea of our TCN is to exploit the low-rank property of the visual relationship tensor, so as to leverage correlations within and across objects and relations and make a structured prediction of all visual relationships in an image. To show the effectiveness of our model, we first empirically compare our model with Multi-Label Image Classification (MLIC) methods, eXtreme Multi-label Classification (XMC) methods, and VRD methods. We then show that thanks to our tensor (de)composition layer, our model can predict visual relationships which have not been seen in the training dataset. We finally show our TCN's image-level visual relationship prediction provides a simple and efficient mechanism for relation-based image-retrieval even compared with VRD methods.
翻译:我们提出了一个小说Tensor成份网(TCN)来预测图像中的视觉关系。视觉关系预测(VRP)比常规图像标记(VRP)提供了比常规图像标记(VRP)更具有挑战性的图像理解测试,并且由于标签空间大和注解不完整而难以学习。我们的TCN的关键想法是利用视觉关系变速器的低级属性,从而利用天体和关系内部和相互之间的关联,并对所有图像中的视觉关系进行结构化预测。为了显示我们的模型的有效性,我们首先将我们的模型与多拉贝尔图像分类(MLIC)方法、eXtreme多标签分类(XMC)方法和VRD方法进行了经验性比较。我们然后表明,由于我们的高频(de)相位层,我们的模型可以预测在培训数据集中未见的视觉关系。我们最后展示了我们的TCN图像级视觉关系预测提供了一种简单而有效的关系图像检索机制,即使与VRD方法相比,也是一种基于关系的图像检索机制。