We present a method to learn compositional multi-object dynamics models from image observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and graph neural networks. NeRFs have become a popular choice for representing scenes due to their strong 3D prior. However, most NeRF approaches are trained on a single scene, representing the whole scene with a global model, making generalization to novel scenes, containing different numbers of objects, challenging. Instead, we present a compositional, object-centric auto-encoder framework that maps multiple views of the scene to a set of latent vectors representing each object separately. The latent vectors parameterize individual NeRFs from which the scene can be reconstructed. Based on those latent vectors, we train a graph neural network dynamics model in the latent space to achieve compositionality for dynamics prediction. A key feature of our approach is that the latent vectors are forced to encode 3D information through the NeRF decoder, which enables us to incorporate structural priors in learning the dynamics models, making long-term predictions more stable compared to several baselines. Simulated and real world experiments show that our method can model and learn the dynamics of compositional scenes including rigid and deformable objects. Video: https://dannydriess.github.io/compnerfdyn/
翻译:我们展示了一种方法,从基于隐性物体编码器、神经辐射场(NeRFs)和图形神经网络的图像观测中学习成像性多球动态模型。 NERFs 已经成为代表场景的流行选择, 因为它们之前的3D强力。 然而, 大多数 NERF 方法都是在一个单一的场景上培训的, 以全球模型代表整个场景, 向新场景进行概括化, 包含不同数量的对象, 具有挑战性。 相反, 我们展示了一个成像性、 以对象为中心的自动编码框架, 将场景的多重视图映射为一组分别代表每个对象的潜在矢量。 潜伏矢量将单个的NERFs 参数化为参数, 从中可以重建场景。 基于这些潜在的矢量, 我们在潜在空间中训练一个图形神经网络动态动态模型模型模型模型模型模型模型, 以便实现动态预测的构成。 我们的方法的一个关键特征是, 潜伏矢量矢量矢量通过 NERF decoder 来将3D 信息编码化为3D 信息, 。 使我们能够在学习动态模型模型/comblible 和图像模型中将结构前, 能够将长期预测与若干基线相比更稳定。