The recent development of \emph{foundation models} for monocular depth estimation such as Depth Anything paved the way to zero-shot monocular depth estimation. Since it returns an affine-invariant disparity map, the favored technique to recover the metric depth consists in fine-tuning the model. However, this stage is not straightforward, it can be costly and time-consuming because of the training and the creation of the dataset. The latter must contain images captured by the camera that will be used at test time and the corresponding ground truth. Moreover, the fine-tuning may also degrade the generalizing capacity of the original model. Instead, we propose in this paper a new method to rescale Depth Anything predictions using 3D points provided by sensors or techniques such as low-resolution LiDAR or structure-from-motion with poses given by an IMU. This approach avoids fine-tuning and preserves the generalizing power of the original depth estimation model while being robust to the noise of the sparse depth, of the camera-LiDAR calibration or of the depth model. Our experiments highlight enhancements relative to zero-shot monocular metric depth estimation methods, competitive results compared to fine-tuned approaches and a better robustness than depth completion approaches. Code available at github.com/ENSTA-U2IS-AI/depth-rescaling.
翻译:近期单目深度估计基础模型(如Depth Anything)的发展为零样本单目深度估计开辟了新途径。由于该模型输出仿射不变视差图,恢复度量深度的主流技术通常需要对模型进行微调。然而,这一阶段并不直接,可能因训练和数据集构建而成本高昂且耗时。数据集必须包含测试时将使用的相机采集图像及其对应真值。此外,微调还可能降低原始模型的泛化能力。为此,本文提出一种新方法,利用传感器(如低分辨率LiDAR)或技术(如基于IMU提供位姿的运动恢复结构)提供的3D点云对Depth Anything的预测结果进行重缩放。该方法避免了微调过程,在保持原始深度估计模型泛化能力的同时,对稀疏深度噪声、相机-LiDAR标定误差或深度模型噪声具有鲁棒性。实验结果表明:相较于零样本单目度量深度估计方法,本方法性能显著提升;与微调方法相比结果具有竞争力;且比深度补全方法具有更好的鲁棒性。代码发布于github.com/ENSTA-U2IS-AI/depth-rescaling。