We present ObPose, an unsupervised object-centric generative model that learns to segment 3D objects from RGB-D video in an unsupervised manner. Inspired by prior art in 2D representation learning, ObPose considers a factorised latent space, separately encoding object-wise location (where) and appearance (what) information. In particular, ObPose leverages an object's canonical pose, defined via a minimum volume principle, as a novel inductive bias for learning the where component. To achieve this, we propose an efficient, voxelised approximation approach to recover the object shape directly from a neural radiance field (NeRF). As a consequence, ObPose models scenes as compositions of NeRFs representing individual objects. When evaluated on the YCB dataset for unsupervised scene segmentation, ObPose outperforms the current state-of-the-art in 3D scene inference (ObSuRF) by a significant margin in terms of segmentation quality for both video inputs as well as for multi-view static scenes. In addition, the design choices made in the ObPose encoder are validated with relevant ablations.
翻译:我们介绍ObPose, 这是一种以不受监督的方式从 RGB-D 视频中学习 3D 对象的不受监督的外向基因模型, 以不受监督的方式从 RGB- D 视频中分解 3D 对象。 在2D 演示学习中的先前艺术的启发下, ObPose 考虑一个因素化的潜在空间, 单独编码物体偏向位置( 在哪里) 和外观( 是什么) 信息 。 特别是, ObPose 利用一个通过最小体积原则定义的物体的方形, 作为一种新颖的感应偏差, 来了解部件的方位。 为了实现这一点, 我们建议一种高效的、 氧化的近似方法, 直接从神经光场( NERF) 中恢复对象形状。 结果, ObPose 模型场景是代表单个物体的内RF的构成。 在用 YCB 数据集评价未受监控的场景分解时, ObPose 超越了当前3D 场景推理( ObSuSRF) 的状态, 在分解质量方面有很大的差差差差差差差差差差值。 为了显著差差差差差差差差差值。 为了。 为了两种差差差差差值, 我们OBBBs 中所作的设计选择是相关的。