We propose a novel approach for 3D video synthesis that is able to represent multi-view video recordings of a dynamic real-world scene in a compact, yet expressive representation that enables high-quality view synthesis and motion interpolation. Our approach takes the high quality and compactness of static neural radiance fields in a new direction: to a model-free, dynamic setting. At the core of our approach is a novel time-conditioned neural radiance fields that represents scene dynamics using a set of compact latent codes. To exploit the fact that changes between adjacent frames of a video are typically small and locally consistent, we propose two novel strategies for efficient training of our neural network: 1) An efficient hierarchical training scheme, and 2) an importance sampling strategy that selects the next rays for training based on the temporal variation of the input videos. In combination, these two strategies significantly boost the training speed, lead to fast convergence of the training process, and enable high quality results. Our learned representation is highly compact and able to represent a 10 second 30 FPS multi-view video recording by 18 cameras with a model size of just 28MB. We demonstrate that our method can render high-fidelity wide-angle novel views at over 1K resolution, even for highly complex and dynamic scenes. We perform an extensive qualitative and quantitative evaluation that shows that our approach outperforms the current state of the art. We include additional video and information at: https://neural-3d-video.github.io/
翻译:我们为3D视频合成提出了一个新颖的方法,该方法能够代表一个动态真实世界场景的多视图视频记录,以一个紧凑的、但有表情的表述方式代表一个动态真实世界场景的多视角视频记录,从而能够实现高质量的视图合成和动态内插。我们的方法将静态神经光亮场的高质量和紧凑性引入一个新的方向:一个没有模型的动态环境。我们的方法的核心是一个新的、有时间条件的神经光亮场,它使用一套紧凑的潜在代码代表一个场景动态。为了利用一个视频相邻框架之间变化通常规模小而地方一致这一事实,我们提出了两个高效培训神经网络的新战略:1)一个高效的等级培训计划,以及2)一个基于输入视频的时变而选择下一个培训线的重要取样战略。加在一起,这两个战略大大提升了培训速度,导致培训过程快速融合,并能够带来高质量的结果。我们所学到的图像非常紧凑近30个多视角,由18个摄影机制作,其型尺寸只有28MB。我们展示的方法可以让高清晰度和高清晰度、高清晰度的图像展示。我们目前展示了高清晰度和高清晰的图像。