A 3D scene consists of a set of objects, each with a shape and a layout giving their position in space. Understanding 3D scenes from 2D images is an important goal, with applications in robotics and graphics. While there have been recent advances in predicting 3D shape and layout from a single image, most approaches rely on 3D ground truth for training which is expensive to collect at scale. We overcome these limitations and propose a method that learns to predict 3D shape and layout for objects without any ground truth shape or layout information: instead we rely on multi-view images with 2D supervision which can more easily be collected at scale. Through extensive experiments on 3D Warehouse, Hypersim, and ScanNet we demonstrate that our approach scales to large datasets of realistic images, and compares favorably to methods relying on 3D ground truth. On Hypersim and ScanNet where reliable 3D ground truth is not available, our approach outperforms supervised approaches trained on smaller and less diverse datasets.
翻译:3D 场景由一组天体组成, 每个天体都有形状和布局, 在空间定位。 从 2D 图像中了解 3D 场景是一个重要的目标, 应用在机器人和图形中。 虽然最近从单一图像中预测 3D 形状和布局方面有所进展, 但大多数方法都依靠 3D 地面真相 来进行培训, 培训费用昂贵, 无法大规模收集。 我们克服了这些限制, 并提出了一种方法, 用于为没有任何地面真相形状或布局信息的天体预测 3D 形状和布局 : 相反, 我们依靠以 2D 监督的多视图图像, 并且可以更容易在规模上采集。 通过对 3D 仓库、 Hypersim 和 ScanNet 进行广泛的实验, 我们展示了我们对于实际图像大型数据集的尺度, 并比依靠 3D 地面真相 的方法要好。 在无法找到可靠的 3D 地面真相的超镜像和扫描网, 我们的方法超越了在较小和较不多样化的数据集上训练的受监督的方法 。