Given a monocular video, segmenting and decoupling dynamic objects while recovering the static environment is a widely studied problem in machine intelligence. Existing solutions usually approach this problem in the image domain, limiting their performance and understanding of the environment. We introduce Decoupled Dynamic Neural Radiance Field (D$^2$NeRF), a self-supervised approach that takes a monocular video and learns a 3D scene representation which decouples moving objects, including their shadows, from the static background. Our method represents the moving objects and the static background by two separate neural radiance fields with only one allowing for temporal changes. A naive implementation of this approach leads to the dynamic component taking over the static one as the representation of the former is inherently more general and prone to overfitting. To this end, we propose a novel loss to promote correct separation of phenomena. We further propose a shadow field network to detect and decouple dynamically moving shadows. We introduce a new dataset containing various dynamic objects and shadows and demonstrate that our method can achieve better performance than state-of-the-art approaches in decoupling dynamic and static 3D objects, occlusion and shadow removal, and image segmentation for moving objects.
翻译:在恢复静态环境的同时,如果有一个单向视频、分解和分离的动态物体,则在机器智能中是一个广泛研究的问题。现有的解决方案通常在图像领域处理这一问题,限制其性能和对环境的了解。我们引入了一种自我监督的方法,即分解动态神经辐射场(D$2$NERF),采取单向视频,学习一个3D场景演示,从静态背景中分离物体,包括阴影。我们的方法代表着移动对象和静态背景,用两个单独的神经光亮场代表着移动对象和静态背景,只有一个允许时间变化。这个方法的天真的实施导致动态组成部分占据静态部分,因为静态部分代表前者本身就比较笼统,容易过度适应。为此,我们提出一种新的损失,以促进现象的正确分离。我们进一步建议建立一个阴影场网络,以探测和分辨动态阴影,我们引入了包含各种动态对象和阴影的新数据集,并表明我们的方法能够比在分解动态和静态3D图像移动物体、静态分离和静态分离的物体中采用状态技术方法取得更好的性效果。