In this paper, we target at the problem of learning a generalizable dynamic radiance field from monocular videos. Different from most existing NeRF methods that are based on multiple views, monocular videos only contain one view at each timestamp, thereby suffering from ambiguity along the view direction in estimating point features and scene flows. Previous studies such as DynNeRF disambiguate point features by positional encoding, which is not transferable and severely limits the generalization ability. As a result, these methods have to train one independent model for each scene and suffer from heavy computational costs when applying to increasing monocular videos in real-world applications. To address this, We propose MonoNeRF to simultaneously learn point features and scene flows with point trajectory and feature correspondence constraints across frames. More specifically, we learn an implicit velocity field to estimate point trajectory from temporal features with Neural ODE, which is followed by a flow-based feature aggregation module to obtain spatial features along the point trajectory. We jointly optimize temporal and spatial features by training the network in an end-to-end manner. Experiments show that our MonoNeRF is able to learn from multiple scenes and support new applications such as scene editing, unseen frame synthesis, and fast novel scene adaptation.
翻译:在本文中,我们的目标是从单视视频中学习一个通用的动态光亮场。不同于大多数现有的基于多个视图的NERF方法,单视视频在每一个时印上只包含一个视图,从而在估计点特征和场景流的视觉方向上出现模糊不清。 DynNERF 先前的研究,例如DynNERF通过定位编码分辨点特征,这是不可转让的,严重限制了一般化能力。结果,这些方法必须针对每个场景训练一个独立的模型,在应用到增加现实世界应用中的单视视频时,会遭受沉重的计算成本。为了解决这个问题,我们建议MonnoNERRF同时学习点特征和场景流动,同时学习点轨迹轨迹,并设置跨框架的对应限制。更具体地说,我们学习了一个隐含速度的场域,以根据Neural ODE的时地特征来估计点轨轨迹,随后是一条基于流基特征的组合模块,以获得点轨迹的空间特征。我们共同优化时间和空间特征,通过对网络进行端对端到端培训来进行计算。实验表明,我们的MONNERF能够从多个场面学习,并支持新的应用,如现场编辑、视觉、视觉合成和图像的合成。