Previous work has demonstrated learning isolated 3D objects (voxel grids, point clouds, meshes, etc.) from 2D-only self-supervision. We here set out to extend this to entire 3D scenes made out of multiple objects, including their location, orientation and type, and the scenes illumination. Once learned, we can map arbitrary 2D images to 3D scene structure. We analyze why analysis-by-synthesis-like losses for supervision of 3D scene structure using differentiable rendering is not practical, as it almost always gets stuck in local minima of visual ambiguities. This can be overcome by a novel form of training: we use an additional network to steer the optimization itself to explore the full gamut of possible solutions i.e. to be curious, and hence, to resolve those ambiguities and find workable minima. The resulting system converts 2D images of different virtual or real images into complete 3D scenes, learned only from 2D images of those scenes.
翻译:先前的工作已经演示了从 2D 唯一的自我监督视图中学习孤立的 3D 对象( voxel greats、 点云、 meshes 等) 。 我们在此开始将它扩展至由多个对象组成的整个 3D 场景, 包括它们的位置、 方向和类型, 以及场景照明。 一旦学习, 我们就可以将任意的 2D 图像映射为 3D 场景结构 。 我们分析为什么使用不同图像来监督 3D 场景结构的 分析- 逐个合成一样的丢失不切实际, 因为它几乎总是被困在本地视觉模糊的迷你片中 。 这可以通过一种新颖的培训形式来克服 : 我们使用一个额外的网络来引导优化 自己 来探索全部可能的解决方案, 即 好奇, 从而解决这些模糊性并找到可行的迷你。 由此产生的系统将不同虚拟或真实图像的 2D 图像转换成完整的 3D 场景, 仅从这些场景的 2D 图像中学习 。