Inferring a meaningful geometric scene representation from a single image is a fundamental problem in computer vision. Approaches based on traditional depth map prediction can only reason about areas that are visible in the image. Currently, neural radiance fields (NeRFs) can capture true 3D including color, but are too complex to be generated from a single image. As an alternative, we propose to predict implicit density fields. A density field maps every location in the frustum of the input image to volumetric density. By directly sampling color from the available views instead of storing color in the density field, our scene representation becomes significantly less complex compared to NeRFs, and a neural network can predict it in a single forward pass. The prediction network is trained through self-supervision from only video data. Our formulation allows volume rendering to perform both depth prediction and novel view synthesis. Through experiments, we show that our method is able to predict meaningful geometry for regions that are occluded in the input image. Additionally, we demonstrate the potential of our approach on three datasets for depth prediction and novel-view synthesis.
翻译:摘要:从单张图像推断有意义的几何场景表达是计算机视觉中的一个基本问题。传统深度图预测的方法只能推断出图像中可见的区域。目前,基于神经辐射场(NeRFs)的方法可以捕捉到真实的三维颜色,但是从单张图像进行生成过于复杂。作为一种替代方法,我们提出了预测隐式密度场的方法。密度场将输入图像锥体中的每个位置映射到体积密度。通过直接从可用视角采样颜色而不是在密度场中存储颜色,与NeRF相比,我们的场景表示变得更简单,神经网络可以在单次前向传递中进行预测。预测网络通过自监督的视频数据进行训练。我们的公式允许体积渲染同时进行深度预测和新视角合成。通过实验,我们展示了我们的方法能够预测出输入图像中被遮挡的有意义几何结构。此外,我们证明了我们的方法在三个深度预测和新视角合成数据集上的潜力。