A central challenge in 3D scene perception via inverse graphics is robustly modeling the gap between 3D graphics and real-world data. We propose a novel 3D Neural Embedding Likelihood (3DNEL) over RGB-D images to address this gap. 3DNEL uses neural embeddings to predict 2D-3D correspondences from RGB and combines this with depth in a principled manner. 3DNEL is trained entirely from synthetic images and generalizes to real-world data. To showcase this capability, we develop a multi-stage inverse graphics pipeline that uses 3DNEL for 6D object pose estimation from real RGB-D images. Our method outperforms the previous state-of-the-art in sim-to-real pose estimation on the YCB-Video dataset, and improves robustness, with significantly fewer large-error predictions. Unlike existing bottom-up, discriminative approaches that are specialized for pose estimation, 3DNEL adopts a probabilistic generative formulation that jointly models multi-object scenes. This generative formulation enables easy extension of 3DNEL to additional tasks like object and camera tracking from video, using principled inference in the same probabilistic model without task specific retraining.
翻译:通过反向图形对 3D 场景感知的一个中心挑战,是强有力地模拟 3D 图形与真实世界数据之间的差距。我们提议对 RGB-D 图像进行新的 3D 3D 嵌入隐隐隐隐隐隐隐隐隐隐隐(3DNEL) 以缩小这一差距。 3DNEL 使用神经嵌入来预测来自 RGB 的 2D-3D 对应信息,并以有原则的方式将其与深度结合起来。 3DNEL 完全从合成图像中进行训练,并概括到真实世界数据。 为了展示这一能力,我们开发了一个多阶段反映射管道,用 3DNEL 来对 6D 对象进行真实 RGB- D 图像的估算。 我们的方法超越了以前对 YCB- Video 数据集的模拟真实面貌的状态估计, 并且提高了可靠性, 大大降低了大型的预测。 与现有的自下层图象, 专门用于显示真实世界数据的区别性方法不同, 3DNEL 采用了一个多位目标模型的多位谱化补充基因化配方配方配方, 。这种在常规跟踪任务中可以轻易地扩展任务中, 3CMiscoaltravicaltravial lavel lap lad laveal ta lad 3L lad lad lad