Representing scenes at the granularity of objects is a prerequisite for scene understanding and decision making. We propose PriSMONet, a novel approach based on Prior Shape knowledge for learning Multi-Object 3D scene decomposition and representations from single images. Our approach learns to decompose images of synthetic scenes with multiple objects on a planar surface into its constituent scene objects and to infer their 3D properties from a single view. A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image. By differentiable rendering, we train our model to decompose scenes from RGB-D images in a self-supervised way. The 3D shapes are represented continuously in function-space as signed distance functions which we pre-train from example shapes in a supervised way. These shape priors provide weak supervision signals to better condition the challenging overall learning task. We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
翻译:显示物体颗粒的场景是现场理解和决策的一个先决条件。 我们提出PriSMonnet, 这是一种基于先前形状知识的新颖方法, 用于学习单个图像中的多对象 3D 场面分解和演示。 我们的方法是将图层表面上多个物体的合成场景图像分解到其构成场景物体中, 并从一个视图中推断出其3D属性。 一个经常性的编码器从输入的 RGB 图像中呈现出3D 形状、 外形和纹理的潜值。 我们通过不同的显示, 训练我们的模型, 以自我监督的方式将图像从 RGB- D 图像中分解出来。 3D 形状在功能空间中持续呈现为签名的远程功能, 我们从示例中预导出一个受监督的方式形状。 这些形状之前提供薄弱的监督信号, 以更好地为总体学习任务提供条件。 我们评估了3D 场景布局的模型的准确性, 展示其基因化能力, 评估其对真实图像的概括性, 并指明所学的代表性的好处 。