Single-view depth estimation from omnidirectional images has gained popularity with its wide range of applications such as autonomous driving and scene reconstruction. Although data-driven learning-based methods demonstrate significant potential in this field, scarce training data and ineffective 360 estimation algorithms are still two key limitations hindering accurate estimation across diverse domains. In this work, we first establish a large-scale dataset with varied settings called Depth360 to tackle the training data problem. This is achieved by exploring the use of a plenteous source of data, 360 videos from the internet, using a test-time training method that leverages unique information in each omnidirectional sequence. With novel geometric and temporal constraints, our method generates consistent and convincing depth samples to facilitate single-view estimation. We then propose an end-to-end two-branch multi-task learning network, SegFuse, that mimics the human eye to effectively learn from the dataset and estimate high-quality depth maps from diverse monocular RGB images. With a peripheral branch that uses equirectangular projection for depth estimation and a foveal branch that uses cubemap projection for semantic segmentation, our method predicts consistent global depth while maintaining sharp details at local regions. Experimental results show favorable performance against the state-of-the-art methods.
翻译:通过自主驱动和场景重建等广泛应用,从全向图像中进行的单视深度估计越来越受欢迎。虽然数据驱动的学习方法显示在这一领域具有巨大潜力,但稀缺的培训数据和无效的360估算算法仍然是阻碍不同领域准确估算的两个关键限制。在这项工作中,我们首先建立一个由不同环境组成的大型数据集,称为 " 深度360 " 以解决培训数据问题。这是通过探索使用一个宽度的数据来源,即来自互联网的360个视频,使用测试时间培训方法,利用每个全向序列的独特信息。在新的几何和时间限制下,我们的方法生成了一致和令人信服的深度样本,以便利单一视图估算。然后我们提出了一个端到端的双层多任务学习网络,SegFuse,它模拟人类的眼睛,以便有效地从数据集中学习,并从不同的单向 RGB 图像中估算高质量深度地图。一个外围分支,使用精确的矩形预测来进行深度估算,以及一个顶端分支,利用立方形预测,使用精确的深度样本样本样本样本,用以进行全球深度预测,同时用直径地预测,同时显示全球深度预测。