Inferring a meaningful geometric scene representation from a single image is a fundamental problem in computer vision. Approaches based on traditional depth map prediction can only reason about areas that are visible in the image. Currently, neural radiance fields (NeRFs) can capture true 3D including color but are too complex to be generated from a single image. As an alternative, we introduce a neural network that predicts an implicit density field from a single image. It maps every location in the frustum of the image to volumetric density. Our network can be trained through self-supervision from only video data. By not storing color in the implicit volume, but directly sampling color from the available views during training, our scene representation becomes significantly less complex compared to NeRFs, and we can train neural networks to predict it. Thus, we can apply volume rendering to perform both depth prediction and novel view synthesis. In our experiments, we show that our method is able to predict meaningful geometry for regions that are occluded in the input image. Additionally, we demonstrate the potential of our approach on three datasets for depth prediction and novel-view synthesis.
翻译:从单一图像中推断出有意义的几何场景是计算机视觉的一个基本问题。基于传统深度地图预测的方法只能说明图像中可见的区域。 目前,神经弧度场(NERFs)可以捕捉真实的 3D, 包括颜色, 但太复杂, 无法从单一图像中生成。 作为替代办法, 我们引入一个神经网络, 从单一图像中预测隐含的密度场。 它映射图像的每个角落到体积密度。 我们的网络可以仅仅从视频数据中通过自我监督观察来接受培训。 通过不将彩色储存在隐含的体积中,而是直接从培训期间的现有视图中采集颜色, 我们的场面面表比NERFs要复杂得多, 我们可以训练神经网络来预测它。 因此, 我们可以应用体积来进行深度预测和新视角合成。 我们的实验显示, 我们的方法能够预测输入图像中隐含的区域的有意义的几何测量方法。 此外, 我们展示了三种数据集用于深度预测和新观点合成的方法的潜力。