In this paper, we formulate a potentially valuable panoramic depth completion (PDC) task as panoramic 3D cameras often produce 360{\deg} depth with missing data in complex scenes. Its goal is to recover dense panoramic depths from raw sparse ones and panoramic RGB images. To deal with the PDC task, we train a deep network that takes both depth and image as inputs for the dense panoramic depth recovery. However, it needs to face a challenging optimization problem of the network parameters due to its non-convex objective function. To address this problem, we propose a simple yet effective approach termed M{^3}PT: multi-modal masked pre-training. Specifically, during pre-training, we simultaneously cover up patches of the panoramic RGB image and sparse depth by shared random mask, then reconstruct the sparse depth in the masked regions. To our best knowledge, it is the first time that we show the effectiveness of masked pre-training in a multi-modal vision task, instead of the single-modal task resolved by masked autoencoders (MAE). Different from MAE where fine-tuning completely discards the decoder part of pre-training, there is no architectural difference between the pre-training and fine-tuning stages in our M$^{3}$PT as they only differ in the prediction density, which potentially makes the transfer learning more convenient and effective. Extensive experiments verify the effectiveness of M{^3}PT on three panoramic datasets. Notably, we improve the state-of-the-art baselines by averagely 26.2% in RMSE, 51.7% in MRE, 49.7% in MAE, and 37.5% in RMSElog on three benchmark datasets.
翻译:在本文中,我们设计了一个具有潜在价值的全景深度完成(PDC)任务,因为全景3D照相机经常在复杂的场景中产生360×deg}深度,缺少数据。它的目标是从原始稀薄的图像和全色 RGB 图像中回收密集的全景深度。为了执行PDC任务,我们训练一个深度和图像的深度网络,作为密集全景深度恢复的投入。然而,它需要面对一个具有挑战性的网络参数优化问题,因为其非全景目标功能。为了解决这个问题,我们建议了一个简单而有效的方法,称为 M ⁇ 3}PT:多模式的基线前训练。具体地说,在培训前阶段,我们同时用共享的随机面具来掩盖全景层RGB图像和稀薄的深度。根据我们的最佳知识,这是我们第一次展示多模式的预培训前训练的有效性,而不是由掩码自动剖面的自动剖面仪解决的单一模式任务(MAE):多式的基线前训练前:多式的多式的多式的E3级训练前(MAE),这与MA的精确的模型的精确的变的模型数据阶段相比,这比MADMDGMD数据是更精确的升级的三阶段的变的三阶段,这要改进了。