Existing panoramic depth estimation methods based on convolutional neural networks (CNNs) focus on removing panoramic distortions, failing to perceive panoramic structures efficiently due to the fixed receptive field in CNNs. This paper proposes the panorama transformer (named PanoFormer) to estimate the depth in panorama images, with tangent patches from spherical domain, learnable token flows, and panorama specific metrics. In particular, we divide patches on the spherical tangent domain into tokens to reduce the negative effect of panoramic distortions. Since the geometric structures are essential for depth estimation, a self-attention module is redesigned with an additional learnable token flow. In addition, considering the characteristic of the spherical domain, we present two panorama-specific metrics to comprehensively evaluate the panoramic depth estimation models' performance. Extensive experiments demonstrate that our approach significantly outperforms the state-of-the-art (SOTA) methods. Furthermore, the proposed method can be effectively extended to solve semantic panorama segmentation, a similar pixel2pixel task. Code will be available.
翻译:基于共变神经网络的现有全景深度估计方法侧重于消除全景扭曲,由于CNN固定的可接收场,无法有效地看待全景结构。本文提议全景变异器(名为PanoFormer)来估计全景图像的深度,其光谱域、可学习的象征性流和全景特定度量值的相近补分点。特别是,我们将球形正切域上的补丁分成为象征物,以减少全景扭曲的消极影响。由于几何结构对于深度估测至关重要,因此重新设计了自我注意模块,增加了可学习的象征性流。此外,考虑到球域的特性,我们提出了两种全景色特有度指标,以全面评估全景深度估计模型的性能。广泛的实验表明,我们的方法大大超越了全景区图(SOTA)方法。此外,拟议的方法可以有效地推广到溶解全景区分割,一种类似的像素平ix2ixixx任务代码。