The challenge of dynamic view synthesis from dynamic monocular videos, i.e., synthesizing novel views for free viewpoints given a monocular video of a dynamic scene captured by a moving camera, mainly lies in accurately modeling the dynamic objects of a scene using limited 2D frames, each with a varying timestamp and viewpoint. Existing methods usually require pre-processed 2D optical flow and depth maps by additional methods to supervise the network, making them suffer from the inaccuracy of the pre-processed supervision and the ambiguity when lifting the 2D information to 3D. In this paper, we tackle this challenge in an unsupervised fashion. Specifically, we decouple the motion of the dynamic objects into object motion and camera motion, respectively regularized by proposed unsupervised surface consistency and patch-based multi-view constraints. The former enforces the 3D geometric surfaces of moving objects to be consistent over time, while the latter regularizes their appearances to be consistent across different viewpoints. Such a fine-grained motion formulation can alleviate the learning difficulty for the network, thus enabling it to produce not only novel views with higher quality but also more accurate scene flows and depth than existing methods requiring extra supervision. We will make the code publicly available.
翻译:挑战在于如何准确地对动态场景进行建模,给定由移动摄像机拍摄的动态单目视频,即从动态单目视频中对场景进行重构,合成自由视点。现有方法通常需要通过预处理光流和深度图来监控网络,使它们受到预处理监控的不准确性和将2D信息提升到3D时的歧义的影响。本文提出一种无监督的方法来解决这一挑战。具体地,我们将动态物体的运动分解为物体运动和摄像机运动,分别通过所提出的无监督表面一致性和基于图块的多视图约束进行规范。前者使移动物体的三维几何表面在时间上保持一致,后者使它们的外观在不同视角下保持一致。这样的精细运动描述可以减轻网络的学习难度,从而使其能够产生更高质量的新视角,以及比需要额外监督的现有方法更准确的场景流和深度。我们将公开代码。