Monocular depth estimation has been widely studied, and significant improvements in performance have been recently reported. However, most previous works are evaluated on a few benchmark datasets, such as KITTI datasets, and none of the works provide an in-depth analysis of the generalization performance of monocular depth estimation. In this paper, we deeply investigate the various backbone networks (e.g.CNN and Transformer models) toward the generalization of monocular depth estimation. First, we evaluate state-of-the-art models on both in-distribution and out-of-distribution datasets, which have never been seen during network training. Then, we investigate the internal properties of the representations from the intermediate layers of CNN-/Transformer-based models using synthetic texture-shifted datasets. Through extensive experiments, we observe that the Transformers exhibit a strong shape-bias rather than CNNs, which have a strong texture-bias. We also discover that texture-biased models exhibit worse generalization performance for monocular depth estimation than shape-biased models. We demonstrate that similar aspects are observed in real-world driving datasets captured under diverse environments. Lastly, we conduct a dense ablation study with various backbone networks which are utilized in modern strategies. The experiments demonstrate that the intrinsic locality of the CNNs and the self-attention of the Transformers induce texture-bias and shape-bias, respectively.
翻译:对单体深度估计进行了广泛研究,最近报告了业绩方面的重大改进。然而,以前的大多数工作都是对一些基准数据集(如KITTI数据集)进行的,而以前的大部分工作都是对一些基准数据集(如KITTI数据集)的评价,没有一项工作对单体深度估计的一般性能进行深入分析。在本文中,我们深入地调查了各种主干网络(如CNN和变异模型),这些主干网络(如CNN和变异模型)一般化单体深度估计。首先,我们评估了分布和分配外数据集方面的最新模型,这些模型在网络培训期间从未见过。然后,我们调查了CNN/TREX基础模型中间层使用合成质变数据集对总体性表现的内在性质特性。我们通过广泛的实验发现,变异型网络的形状偏差强,而不是CNNIS,它们具有很强的纹理-偏差。我们还发现,在单体深度估计方面,相对偏差模型的通用性能比形状型模型要差。我们证明,在现实世界中观察到类似方面,光线/变型模型分别驱动着基础网络的深度研究,在多种环境下利用了正态的内基质试验。最后,在深度试验中,不断展示了各种基底环境下,正态网络。