We extend neural radiance fields (NeRFs) to dynamic large-scale urban scenes. Prior work tends to reconstruct single video clips of short durations (up to 10 seconds). Two reasons are that such methods (a) tend to scale linearly with the number of moving objects and input videos because a separate model is built for each and (b) tend to require supervision via 3D bounding boxes and panoptic labels, obtained manually or via category-specific models. As a step towards truly open-world reconstructions of dynamic cities, we introduce two key innovations: (a) we factorize the scene into three separate hash table data structures to efficiently encode static, dynamic, and far-field radiance fields, and (b) we make use of unlabeled target signals consisting of RGB images, sparse LiDAR, off-the-shelf self-supervised 2D descriptors, and most importantly, 2D optical flow. Operationalizing such inputs via photometric, geometric, and feature-metric reconstruction losses enables SUDS to decompose dynamic scenes into the static background, individual objects, and their motions. When combined with our multi-branch table representation, such reconstructions can be scaled to tens of thousands of objects across 1.2 million frames from 1700 videos spanning geospatial footprints of hundreds of kilometers, (to our knowledge) the largest dynamic NeRF built to date. We present qualitative initial results on a variety of tasks enabled by our representations, including novel-view synthesis of dynamic urban scenes, unsupervised 3D instance segmentation, and unsupervised 3D cuboid detection. To compare to prior work, we also evaluate on KITTI and Virtual KITTI 2, surpassing state-of-the-art methods that rely on ground truth 3D bounding box annotations while being 10x quicker to train.
翻译:我们将神经辐射场(NeRF)推广到动态的大规模城市场景中。以往的研究往往重建短时期(最多10秒)的单个视频剪辑。原因有两个:一是这种方法往往随着移动物体和输入视频数量的增加呈线性扩展,因为需要为每个物体构建一个单独的模型,二是往往需要通过手动或通过类别特定模型获得的三维边界框和全景标签来监督学习。作为迈向真正的开放世界动态城市重建的一步,我们引入了两个关键创新:一是将场景分解为三个独立的哈希表数据结构,以编码静态、动态和远场辐射场,二是利用无标签的目标信号,包括RGB图像、稀疏的LiDAR数据、开箱即用的自监督二维描述符和最重要的2D光流。通过光度、几何和特征度量重构损失将这些输入操作化,SUDS能够将动态场景分解为静态背景、单个物体和它们的运动。当与我们的多分支哈希表表示结合使用时,这样的重建可以扩展到1700个视频、120万帧、跨越数百公里的地理区域,涵盖数以万计的物体,(据我们所知)是迄今为止建造的最大的动态NeRF。我们呈现了基于我们的重构的各种任务的初步定性结果,包括动态城市场景的新视角合成、无监督的3D实例分割和无监督的3D立体检测。为了和之前的工作比较,我们还在KITTI和Virtual KITTI 2上进行了评估,在比依赖于地面真实3D边界框注释的最先进方法快10倍的情况下,超过了它们的表现。