We present a novel method for multi-view depth estimation from a single video, which is a critical task in various applications, such as perception, reconstruction and robot navigation. Although previous learning-based methods have demonstrated compelling results, most works estimate depth maps of individual video frames independently, without taking into consideration the strong geometric and temporal coherence among the frames. Moreover, current state-of-the-art (SOTA) models mostly adopt a fully 3D convolution network for cost regularization and therefore require high computational cost, thus limiting their deployment in real-world applications. Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer to explicitly associate geometric and temporal correlation with multiple estimated depth maps. Furthermore, to reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network consisting of a 2D context-aware network and a 3D matching network which learn 2D context information and 3D disparity cues separately. Extensive experiments demonstrate that our method achieves higher accuracy in depth estimation and significant speedup than the SOTA methods.
翻译:我们提出了一个从单一视频进行多视深度估计的新方法,这是各种应用,例如认知、重建和机器人导航中的一项关键任务。虽然以前以学习为基础的方法已经展示出令人信服的结果,但大多数工程都独立地估算单个视频框架的深度地图,而没有考虑到这些框架之间强大的几何和时间一致性。此外,目前最先进的模型大多采用完全3D变动网络进行成本正规化,因此需要高计算成本,从而限制在现实世界应用中的部署。我们的方法通过使用新的Epipolar Spatio-Temporal(EST)变异器实现时间上一致的深度估计结果,从而将几何和时间相关性与多估计深度地图明确联系起来。此外,为了降低计算成本,我们设计了一个由2D背景识别网络和3D匹配网络组成的紧凑混合网络,分别学习2D背景信息和3D差异信号。广泛的实验表明,我们的方法在深度估计和速度上比SOTA方法更精准。