Due to difficulties in acquiring ground truth depth of equirectangular (360) images, the quality and quantity of equirectangular depth data today is insufficient to represent the various scenes in the world. Therefore, 360 depth estimation studies, which relied solely on supervised learning, are destined to produce unsatisfactory results. Although self-supervised learning methods focusing on equirectangular images (EIs) are introduced, they often have incorrect or non-unique solutions, causing unstable performance. In this paper, we propose 360 monocular depth estimation methods which improve on the areas that limited previous studies. First, we introduce a self-supervised 360 depth learning method that only utilizes gravity-aligned videos, which has the potential to eliminate the needs for depth data during the training procedure. Second, we propose a joint learning scheme realized by combining supervised and self-supervised learning. The weakness of each learning is compensated, thus leading to more accurate depth estimation. Third, we propose a non-local fusion block, which can further retain the global information encoded by vision transformer when reconstructing the depths. With the proposed methods, we successfully apply the transformer to 360 depth estimations, to the best of our knowledge, which has not been tried before. On several benchmarks, our approach achieves significant improvements over previous works and establishes a state of the art.
 翻译:由于难以获得角形图像(360)的地面真相深度(360),今天的角形深度数据的质量和数量不足以代表世界各种场景。因此,完全依赖监督学习的360个深度估计研究注定会产生不满意的结果。虽然引入了侧重于角形图像(EIs)的自我监督的学习方法,但它们往往有不正确或非独特的解决方案,导致性能不稳定。在本文件中,我们提出了360个单方形深度估计方法,改进了以往研究有限的领域。首先,我们采用了自我监督的360深度学习方法,仅使用重力校准视频,有可能消除培训过程中对深度数据的需要。第二,我们提出了通过将监督和自我监督学习相结合而实现的联合学习计划。每项学习的弱点得到了补偿,从而导致更准确的深度估计。第三,我们提议了一个非局部的集成区,在重建深度时,可以进一步保留通过视觉变异器编码的全球信息。我们成功地应用了重力校准的360深度方法,我们没有尝试过前几次深层次的变换方法,在以前的深点上也尝试过前几次的深度评估。