Visual Relation Detection (VRD) aims to detect relationships between objects for image understanding. Most existing VRD methods rely on thousands of training samples of each relationship to achieve satisfactory performance. Some recent papers tackle this problem by few-shot learning with elaborately designed pipelines and pre-trained word vectors. However, the performance of existing few-shot VRD models is severely hampered by the poor generalization capability, as they struggle to handle the vast semantic diversity of visual relationships. Nonetheless, humans have the ability to learn new relationships with just few examples based on their knowledge. Inspired by this, we devise a knowledge-augmented, few-shot VRD framework leveraging both textual knowledge and visual relation knowledge to improve the generalization ability of few-shot VRD. The textual knowledge and visual relation knowledge are acquired from a pre-trained language model and an automatically constructed visual relation knowledge graph, respectively. We extensively validate the effectiveness of our framework. Experiments conducted on three benchmarks from the commonly used Visual Genome dataset show that our performance surpasses existing state-of-the-art models with a large improvement.
翻译:视觉关系探测(VRD)的目的是探测不同对象之间的关系,以便了解图像。多数现有的VRD方法依靠数千个每一关系的培训样本来达到令人满意的性能。最近的一些论文通过精密设计的管道和经事先训练的文字矢量的微小学习来解决这个问题。然而,现有的微小的VRD模型的性能受到一般化能力差的严重阻碍,因为这些模型难以处理广泛的视觉关系的语义多样性。尽管如此,人类仍然有能力学习新关系,仅以其知识为基础的几个例子。受此启发,我们设计了一个利用文字知识和视觉关系知识的知识强化的微小的VRD框架,以提高微小VRD的通用能力。文字知识和视觉关系知识分别来自经过预先训练的语言模型和自动构造的视觉关系知识图。我们广泛验证了我们框架的有效性。根据常用的视觉基因组数据集的三个基准进行的实验表明,我们的表现大大改进了现有的最新模型。</s>