Due to difficulties in acquiring ground truth depth of equirectangular (360) images, the quality and quantity of equirectangular depth data today is insufficient to represent the various scenes in the world. Therefore, 360 depth estimation studies, which relied solely on supervised learning, are destined to produce unsatisfactory results. Although self-supervised learning methods focusing on equirectangular images (EIs) are introduced, they often have incorrect or non-unique solutions, causing unstable performance. In this paper, we propose 360 monocular depth estimation methods which improve on the areas that limited previous studies. First, we introduce a self-supervised 360 depth learning method that only utilizes gravity-aligned videos, which has the potential to eliminate the needs for depth data during the training procedure. Second, we propose a joint learning scheme realized by combining supervised and self-supervised learning. The weakness of each learning is compensated, thus leading to more accurate depth estimation. Third, we propose a non-local fusion block, which retains global information encoded by vision transformer when reconstructing the depths. With the proposed methods, we successfully apply the transformer to 360 depth estimations, to the best of our knowledge, which has not been tried before. On several benchmarks, our approach achieves significant improvements over previous works and establishes a state of the art.
翻译:由于难以获得角形图像(360)的地面真相深度(360),今天的角形深度数据的质量和数量不足以代表世界各种场景。因此,完全依赖监督学习的360个深度估计研究注定会产生不满意的结果。虽然引入了侧重于角形图像(EIs)的自我监督的学习方法,但它们往往有不正确或非独特的解决方案,导致性能不稳定。在本文件中,我们提出了360个单方形深度估计方法,改进了以往研究有限的领域。首先,我们采用了自我监督的360深度学习方法,仅使用重力校准视频,这有可能消除培训过程中对深度数据的需求。第二,我们提出了一个联合学习计划,将监督和自我监督学习相结合。每项学习的弱点得到补偿,从而导致更准确的深度估计。第三,我们提议了一个非局部的集成区,在重建深度时保留通过视觉变异器编码的全球信息。首先采用拟议的方法,我们成功地应用了重力校准的360深度方法,在培训过程中有可能消除对深度数据的需要。第二,我们提出了一个联合学习计划,通过将受监督和自我监督的深度评估的方法,在以前的深度评估中确定了我们以前的一些重要基准。我们之前,我们已尝试过了一些深点,已经确定了一些深点,在前的深度评估中确定了我们的最佳水平。