Reconstructing 3D object from a single image (RGB or depth) is a fundamental problem in visual scene understanding and yet remains challenging due to its ill-posed nature and complexity in real-world scenes. To address those challenges, we adopt a primitive-based representation for 3D object, and propose a two-stage graph network for primitive-based 3D object estimation, which consists of a sequential proposal module and a graph reasoning module. Given a 2D image, our proposal module first generates a sequence of 3D primitives from input image with local feature attention. Then the graph reasoning module performs joint reasoning on a primitive graph to capture the global shape context for each primitive. Such a framework is capable of taking into account rich geometry and semantic constraints during 3D structure recovery, producing 3D objects with more coherent structure even under challenging viewing conditions. We train the entire graph neural network in a stage-wise strategy and evaluate it on three benchmarks: Pix3D, ModelNet and NYU Depth V2. Extensive experiments show that our approach outperforms the previous state of the arts with a considerable margin.
翻译:从单一图像( RGB 或深度) 重建 3D 对象是视觉场景理解的一个根本问题,但由于其性质和复杂性在现实世界的场景中存在错误,因此仍然具有挑战性。为了应对这些挑战,我们采用了基于 3D 对象的原始代表,并为基于 3D 对象的原始估计提出了两阶段图形网络,其中包括一个顺序建议模块和一个图形推理模块。鉴于一个 2D 图像,我们的提案模块首先从带有本地特征注意的输入图像中产生三维原始序列。然后,图形推理模块对原始图形进行联合推理,以捕捉每个原始世界的形状背景。这样的框架能够考虑到在 3D 结构恢复期间丰富的地貌和语义限制,生成三维对象,其结构更加一致,即使在有挑战性观看条件下也是如此。我们用一个阶段战略来培训整个图形神经网络,并根据三个基准来评估它: Pix3D、 模型网络和 NYU 深度V2。 广泛的实验表明,我们的方法比艺术的先前状态有相当大的距离。