Besides image classification, Contrastive Language-Image Pre-training (CLIP) has accomplished extraordinary success for a wide range of vision tasks, including object-level and 3D space understanding. However, it's still challenging to transfer semantic knowledge learned from CLIP into more intricate tasks of quantified targets, such as depth estimation with geometric information. In this paper, we propose to apply CLIP for zero-shot monocular depth estimation, named DepthCLIP. We found that the patches of the input image could respond to a certain semantic distance token and then be projected to a quantified depth bin for coarse estimation. Without any training, our DepthCLIP surpasses existing unsupervised methods and even approaches the early fully-supervised networks. To our best knowledge, we are the first to conduct zero-shot adaptation from the semantic language knowledge to quantified downstream tasks and perform zero-shot monocular depth estimation. We hope our work could cast a light on future research. The code is available at https://github.com/Adonis-galaxy/DepthCLIP.
翻译:除图像分类外,对比性语言图像培训前(CLIP)在一系列广泛的视觉任务方面取得了非凡的成功,包括目标级和3D空间理解。然而,将从CLIP学得的语义知识转化为量化目标的更复杂的任务,例如与几何信息进行深度估计。在本文中,我们提议应用CLIP来进行零射单眼深度估计,名为“深度CLIP ”。我们发现输入图像的补丁可以响应某种语义距离符号,然后被预测为量化的深度箱,用于粗略估计。没有经过任何培训,我们的深度CLIP就超过了现有的非监督方法,甚至接近早期完全监督的网络。就我们的最佳知识而言,我们首先从语义语言知识中进行零光的调整,以量化下游任务,并进行零射单眼深度估计。我们希望我们的工作能够为未来的研究提供光。代码可在https://github.com/Adonis-galaxy/DephCLIP上查阅。