In this paper, we propose the semantic graph Transformer (SGT) for the 3D scene graph generation. The task aims to parse a cloud point-based scene into a semantic structural graph, with the core challenge of modeling the complex global structure. Existing methods based on graph convolutional networks (GCNs) suffer from the over-smoothing dilemma and could only propagate information from limited neighboring nodes. In contrast, our SGT uses Transformer layers as the base building block to allow global information passing, with two types of proposed Transformer layers tailored for the 3D scene graph generation task. Specifically, we introduce the graph embedding layer to best utilize the global information in graph edges while maintaining comparable computation costs. Additionally, we propose the semantic injection layer to leverage categorical text labels and visual object knowledge. We benchmark our SGT on the established 3DSSG benchmark and achieve a 35.9% absolute improvement in relationship prediction's R@50 and an 80.40% boost on the subset with complex scenes over the state-of-the-art. Our analyses further show SGT's superiority in the long-tailed and zero-shot scenarios. We will release the code and model.
翻译:在本文中,我们提出了用于3D场景图生成的语义图变形金刚网络(SGT)。该任务旨在将基于点云的场景解析为语义结构图,核心挑战在于对复杂全局结构进行建模。现有基于图卷积网络(GCN)的方法存在过度平滑的问题,只能传递有限邻节点的信息。相反,我们的SGT使用变形金刚层作为基本构建块,以允许全局信息传递,并提出两种特定于3D场景图生成任务的Transformer层类型。具体而言,我们引入了图嵌入层,以在维持可比的计算成本的同时最大限度地利用图的全局信息。此外,我们提出了语义注入层,以利用分类文本标签和视觉对象知识。我们在已建立的3DSSG基准测试中对SGT进行基准测试,在关系预测的R@50方面取得了35.9%的绝对改善,并且在具有复杂场景的子集上实现了80.40%的提升,超过了现有技术。我们的分析进一步显示了SGT在长尾和零样本情况下的优越性。我们将提供代码和模型。