The advent of autonomous driving and advanced driver assistance systems necessitates continuous developments in computer vision for 3D scene understanding. Self-supervised monocular depth estimation, a method for pixel-wise distance estimation of objects from a single camera without the use of ground truth labels, is an important task in 3D scene understanding. However, existing methods for this task are limited to convolutional neural network (CNN) architectures. In contrast with CNNs that use localized linear operations and lose feature resolution across the layers, vision transformers process at constant resolution with a global receptive field at every stage. While recent works have compared transformers against their CNN counterparts for tasks such as image classification, no study exists that investigates the impact of using transformers for self-supervised monocular depth estimation. Here, we first demonstrate how to adapt vision transformers for self-supervised monocular depth estimation. Thereafter, we compare the transformer and CNN-based architectures for their performance on KITTI depth prediction benchmarks, as well as their robustness to natural corruptions and adversarial attacks, including when the camera intrinsics are unknown. Our study demonstrates how transformer-based architecture, though lower in run-time efficiency, achieves comparable performance while being more robust and generalizable.
翻译:自动驾驶和先进的驱动器协助系统的出现,要求持续开发计算机视野,以了解3D场景。自我监督的单镜深度估算是3D场理解的一项重要任务。但是,目前的任务方法仅限于进化神经网络(CNN)结构。与使用局部线性操作并在各层之间失去特征分辨率的CNN系统相比,视觉变异器进程在每阶段都有一个全球可接收的场区,以持续解析的方式持续解决。虽然最近的工作比较了变异器与CNN的对口单位的图像分类等任务,但没有研究调查使用变异器进行自我监督单眼深度估测的影响。在这里,我们首先展示了如何将视觉变异器和CNN的架构用于自我监督单眼深度估测。之后,我们比较了变异器和CNN的架构在KITTI深度预测基准上的性能,以及它们对于自然腐败和对抗性攻击的坚固性能,包括当相机的内在性能不为人所知时。我们的研究显示,在可变性、可变性的同时,我们的研究也展示了更强的架构如何在运行上实现。