Depth estimation is an important step in many computer vision problems such as 3D reconstruction, novel view synthesis, and computational photography. Most existing work focuses on depth estimation from single frames. When applied to videos, the result lacks temporal consistency, showing flickering and swimming artifacts. In this paper we aim to estimate temporally consistent depth maps of video streams in an online setting. This is a difficult problem as future frames are not available and the method must choose between enforcing consistency and correcting errors from previous estimations. The presence of dynamic objects further complicates the problem. We propose to address these challenges by using a global point cloud that is dynamically updated each frame, along with a learned fusion approach in image space. Our approach encourages consistency while simultaneously allowing updates to handle errors and dynamic objects. Qualitative and quantitative results show that our method achieves state-of-the-art quality for consistent video depth estimation.
翻译:摘要:深度估计是许多计算机视觉问题的重要步骤,如三维重建、新视角合成和计算摄影。大多数现有工作侧重于从单帧图像中进行深度估计。但是当应用于视频时,结果缺乏时间连续性,出现闪烁和游动现象。在本文中,我们旨在在线场景下估计视频流的一致性深度图。这是一个困难问题,因为无法获取未来帧,该方法必须在强制一致性和纠正以前估计的误差之间进行选择。动态物体的存在进一步加重了问题。我们提议通过使用全局点云来动态更新每一帧,并采用图像空间中的学习融合方法来解决这些挑战。我们的方法在同时鼓励一致性的同时允许更新处理误差和动态对象。定性和定量结果表明,我们的方法在一致性视频深度估计方面实现了最先进的质量。