Representing scenes at the granularity of objects is a prerequisite for scene understanding and decision making. We propose a novel approach for learning multi-object 3D scene representations from images. A recurrent encoder regresses a latent representation of 3D shapes, poses and texture of each object from an input RGB image. The 3D shapes are represented continuously in function-space as signed distance functions (SDF) which we efficiently pre-train from example shapes in a supervised way. By differentiable rendering we then train our model to decompose scenes self-supervised from RGB-D images. Our approach learns to decompose images into the constituent objects of the scene and to infer their shape, pose and texture from a single view. We evaluate the accuracy of our model in inferring the 3D scene layout and demonstrate its generative capabilities.
翻译:显示物体颗粒的场景是现场理解和决策的先决条件。 我们提出一种新颖的方法来从图像中学习多点 3D 场景演示。 一个经常性的编码器从输入的 RGB 图像中回归每个对象的3D 形状、形状和纹理的潜在代表。 3D 形状在功能空间中持续表现为我们以监督的方式从示例形状中有效预导的远程功能。 通过不同的方式, 我们训练我们的模型, 将图像从 RGB- D 图像中自我监督的图像中分解。 我们的方法学会将图像分解到场景的构成对象中, 从单一的角度推断出其形状、 形状和纹理。 我们评估了我们模型在推断 3D 场布局时的准确性, 并展示了它的基因能力 。