Holistic 3D scene understanding entails estimation of both layout configuration and object geometry in a 3D environment. Recent works have shown advances in 3D scene estimation from various input modalities (e.g., images, 3D scans), by leveraging 3D supervision (e.g., 3D bounding boxes or CAD models), for which collection at scale is expensive and often intractable. To address this shortcoming, we propose a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth. Instead, we rely on 2D supervision from multi-view RGB images. Our method represents a 3D scene as a latent vector, from which we can progressively decode to a sequence of objects characterized by their class categories, 3D bounding boxes, and meshes. With our trained autoregressive decoder representing the scene prior, our method facilitates many downstream applications, including scene synthesis, interpolation, and single-view reconstruction. Experiments on 3D-FRONT and ScanNet show that our method outperforms state of the art in single-view reconstruction, and achieves state-of-the-art results in scene synthesis against baselines which require for 3D supervision.
翻译:3D 场景全方位 3D 场景理解要求对3D 环境中的布局配置和对象几何进行估计。 最近的工程显示,通过利用三维监督(如3D捆绑框或 CAD 模型),从各种输入模式(如图像、3D 扫描)对三维场景进行估计的进展,3D场景估计费用昂贵,而且往往难以完成。 为了解决这一缺陷,我们提出了一种新的方法,在不需要任何 3D 地面事实的情况下,学习三维场景的布局和形状前置和形状。相反,我们依靠多视图 RGB 图像的二维场景监督。 我们的方法代表了三维场景作为潜在矢量的场景,我们可以从中逐渐解码到以其类别为特征的物体序列( 3D 捆绑框) 和 meshes 。 我们经过培训的自动递解解解的显示, 之前代表场景的图解解的模型有助于许多下游应用, 包括场景合成、 内集、 和单视重建。 3D 实验显示我们的方法超越了单一视野重建中的艺术的状态, 并实现三维合成基线。