We present a new pipeline for holistic 3D scene understanding from a single image, which could predict object shape, object pose, and scene layout. As it is a highly ill-posed problem, existing methods usually suffer from inaccurate estimation of both shapes and layout especially for the cluttered scene due to the heavy occlusion between objects. We propose to utilize the latest deep implicit representation to solve this challenge. We not only propose an image-based local structured implicit network to improve the object shape estimation, but also refine 3D object pose and scene layout via a novel implicit scene graph neural network that exploits the implicit local object features. A novel physical violation loss is also proposed to avoid incorrect context between objects. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods in terms of object shape, scene layout estimation, and 3D object detection.
翻译:我们提出了一个新的管道,从单一图像中全面了解三维场景,它可以预测物体形状、物体布局和场景布局。由于这是一个非常不正确的问题,现有方法通常会因对形状和布局的不准确估计而受到影响,特别是由于物体之间的严重隔绝,对布局的不准确。我们提议利用最新的深层隐含表示来应对这一挑战。我们不仅提议一个基于图像的本地结构化隐含网络来改进物体形状估计,而且还通过利用隐含的当地物体特征的新隐含的场景图示神经网络来改进三维物体的布局和场景布局。还提议了一个新的物理侵犯损失,以避免物体之间的不正确环境。广泛的实验表明,我们的方法在物体形状、场景布局估计和3D物体探测方面优于最先进的方法。