We present ObPose, an unsupervised object-centric inference and generation model which learns 3D-structured latent representations from RGB-D scenes. Inspired by prior art in 2D representation learning, ObPose considers a factorised latent space, separately encoding object location (where) and appearance (what). ObPose further leverages an object's pose (i.e. location and orientation), defined via a minimum volume principle, as a novel inductive bias for learning the where component. To achieve this, we propose an efficient, voxelised approximation approach to recover the object shape directly from a neural radiance field (NeRF). As a consequence, ObPose models each scene as a composition of NeRFs, richly representing individual objects. To evaluate the quality of the learned representations, ObPose is evaluated quantitatively on the YCB and CLEVR datatasets for unsupervised scene segmentation, outperforming the current state-of-the-art in 3D scene inference (ObSuRF) by a significant margin. Generative results provide qualitative demonstration that the same ObPose model can both generate novel scenes and flexibly edit the objects in them. These capacities again reflect the quality of the learned latents and the benefits of disentangling the where and what components of a scene. Key design choices made in the ObPose encoder are validated with ablations.
翻译:我们展示了ObPose, 这是一种不受监督的以物体为中心的推断和生成模型, 它从 RGB- D 场景中学习了 3D 结构化的潜表。 在2D 演示学习中的先前艺术启发下, ObPose 考虑了一个因素化的潜在空间, 分别编码对象位置( 地点) 和外观( 是什么 ) 。 ObPose 进一步利用一个通过最小量原则界定的物体的外观( 即位置和方向), 作为一种新颖的感应偏差, 用于了解组件的所在地。 为了实现这一点, 我们建议了一种高效的、 杂化的近似方法, 直接从神经光场( NERF) 中恢复物体形状的形状。 因此, OSPose 模型每个场景都是NERFs 的构成, 并丰富地代表了单个对象。 为了评估所学的外观质量, ObPose 正在从YCB 和 CLEVR 数据塔中进行定量评估,, 以不受监控的场景分分分, 3D 场景( ObSURF ) 评) 的当前状态(ObSUR ) 。 的评结果提供了一个显著的精度。