In this paper, we propose a learning-based method for predicting dense depth values of a scene from a monocular omnidirectional image. An omnidirectional image has a full field-of-view, providing much more complete descriptions of the scene than perspective images. However, fully-convolutional networks that most current solutions rely on fail to capture rich global contexts from the panorama. To address this issue and also the distortion of equirectangular projection in the panorama, we propose Cubemap Vision Transformers (CViT), a new transformer-based architecture that can model long-range dependencies and extract distortion-free global features from the panorama. We show that cubemap vision transformers have a global receptive field at every stage and can provide globally coherent predictions for spherical signals. To preserve important local features, we further design a convolution-based branch in our pipeline (dubbed GLPanoDepth) and fuse global features from cubemap vision transformers at multiple scales. This global-to-local strategy allows us to fully exploit useful global and local features in the panorama, achieving state-of-the-art performance in panoramic depth estimation.
翻译:在本文中,我们提出一种基于学习的方法,从单眼全向图像中预测场景的密度深度值。全向图像具有完整的视野,对场景的描述比视觉图像更加完整。然而,目前大多数解决方案所依赖的全演网络未能从全景中捕捉到丰富的全球背景。为了解决这一问题,也为了扭曲全景中的等离子投影,我们提议Cubemap视觉变形器(CVYT),这是一个新的基于变压器的架构,可以模拟远程依赖和从全景中提取无扭曲的全球特征。我们显示,立方图变形器在每一个阶段都有一个全球可接受的场,可以为球状信号提供全球一致的预测。为了保存重要的本地特征,我们进一步设计了管道中的以革命为基础的分支(dubbbbbed GLPanoDepeh),并在多个规模上将立方图变形变形器的全球特征融合在一起。这一全球至地方战略使我们能够在全景层的深度中充分利用有用的全球和本地的特征。