While monocular depth estimation (MDE) is an important problem in computer vision, it is difficult due to the ambiguity that results from the compression of a 3D scene into only 2 dimensions. It is common practice in the field to treat it as simple image-to-image translation, without consideration for the semantics of the scene and the objects within it. In contrast, humans and animals have been shown to use higher-level information to solve MDE: prior knowledge of the nature of the objects in the scene, their positions and likely configurations relative to one another, and their apparent sizes have all been shown to help resolve this ambiguity. In this paper, we present a novel method to enhance MDE performance by encouraging use of known-useful information about the semantics of objects and inter-object relationships within a scene. Our novel ObjCAViT module sources world-knowledge from language models and learns inter-object relationships in the context of the MDE problem using transformer attention, incorporating apparent size information. Our method produces highly accurate depth maps, and we obtain competitive results on the NYUv2 and KITTI datasets. Our ablation experiments show that the use of language and cross-attention within the ObjCAViT module increases performance. Code is released at https://github.com/DylanAuty/ObjCAViT.
翻译:虽然单向深度估计(MDE)是计算机视觉中的一个重要问题,但由于将三维场景压缩到仅两个维度中所产生的模糊性,单向深度估计(MDE)是困难的,但是由于将三维场景压缩到仅两个维度中造成模糊性,因此很难将三维场景作为简单的图像到图像翻译处理,而不考虑场景的语义和其中的对象,这是该领域的常见做法。相比之下,人类和动物被证明使用高层次信息解决MDE:先前对现场物体的性质、其位置和可能相对的配置的了解,以及这些物体的表面大小已经显示有助于解决这一模糊性。在本文件中,我们提出了一种新颖的方法,通过鼓励使用已知的关于对象的图像到图像翻译以及场景中的跨对象关系的信息来提高MDE的性能。我们的新ObjCAVT模块从语言模型中获取世界知识,并学习MDEGE问题背景下的明显大小信息。我们的方法产生了非常精确的深度图,我们从NUVT和KITTID数据库中获取了竞争性结果。我们在OBABA/VABIBABS中增加了测试。