With increasing focus on augmented and virtual reality applications (XR) comes the demand for algorithms that can lift objects from images and videos into representations that are suitable for a wide variety of related 3D tasks. Large-scale deployment of XR devices and applications means that we cannot solely rely on supervised learning, as collecting and annotating data for the unlimited variety of objects in the real world is infeasible. We present a weakly supervised method that is able to decompose a single image of an object into shape (depth and normals), material (albedo, reflectivity and shininess) and global lighting parameters. For training, the method only relies on a rough initial shape estimate of the training objects to bootstrap the learning process. This shape supervision can come for example from a pretrained depth network or - more generically - from a traditional structure-from-motion pipeline. In our experiments, we show that the method can successfully de-render 2D images into a decomposed 3D representation and generalizes to unseen object categories. Since in-the-wild evaluation is difficult due to the lack of ground truth data, we also introduce a photo-realistic synthetic test set that allows for quantitative evaluation.
翻译:随着对扩大和虚拟现实应用(XR)的日益重视,对能够将图像和视频中的物体从图像和视频提升为适合各种相关的3D任务的表达方式的算法的需求就出现了。大规模部署 XR 设备和应用程序意味着我们不能仅仅依靠监督学习,因为收集和说明现实世界中各种无限制物体的数据是行不通的。我们展示了一种薄弱的监督方法,它能够将一个物体的单一图像分解成形状(深度和正常)、材料(脉冲、反射和闪光性)和全球照明参数。对于培训而言,该方法只依赖于对训练对象的初步粗略形状估计,以套紧紧地套学习过程。这种形状监督可以例如从一个预先训练的深度网络,或者(更一般地)从传统的结构从移动管道中产生。我们在实验中显示,该方法能够成功地将2D图像分解成一个分解的3D代表形式,并概括到看不见的物体类别。由于缺少地面数据,因此很难进行这种微量的评估。我们还引入了一种摄影-真实性合成试验数据集。