The extraction of a scene graph with objects as nodes and mutual relationships as edges is the basis for a deep understanding of image content. Despite recent advances, such as message passing and joint classification, the detection of visual relationships remains a challenging task due to sub-optimal exploration of the mutual interaction among the visual objects. In this work, we propose a novel transformer formulation for scene graph generation and relation prediction. We leverage the encoder-decoder architecture of the transformer for rich feature embedding of nodes and edges. Specifically, we model the node-to-node interaction with the self-attention of the transformer encoder and the edge-to-node interaction with the cross-attention of the transformer decoder. Further, we introduce a novel positional embedding suitable to handle edges in the decoder. Finally, our relation prediction module classifies the directed relation from the learned node and edge embedding. We name this architecture as Relation Transformer Network (RTN). On the Visual Genome and GQA dataset, we have achieved an overall mean of 4.85% and 3.1% point improvement in comparison with state-of-the-art methods. Our experiments show that Relation Transformer can efficiently model context across various datasets with small, medium, and large-scale relation classification.
翻译:提取带有节点和边缘等对象的图像图形,是深入理解图像内容的基础。尽管最近取得了一些进步,例如信息传递和联合分类等,但发现视觉关系仍然是一项艰巨的任务,因为对视觉对象之间的相互作用进行亚最佳的探索。在这项工作中,我们建议为屏幕图形生成和关系预测提供新型变压器配方。我们利用变压器变压器的编码器解码器结构嵌入节点和边缘的丰富特征嵌入。具体地说,我们模拟与变压器编码器自我保存的节点到节点互动和与变压器解码器交叉保护的边缘到节点互动。此外,我们引入了一种新的定位嵌入适合处理解码器边缘的配置。最后,我们的关系预测模块将所学节点和边缘嵌入的导出关系归为Relationer 变压器网络(RTN)。在视觉基因组和GQA数据集方面,我们实现了4.85%和3.1%的边际互动,我们实现了整个变压模型的总体平均值,与大变压器的中位对比,我们可以展示各种变压的中标。