Self-supervised monocular depth estimation has been widely studied recently. Most of the work has focused on improving performance on benchmark datasets, such as KITTI, but has offered a few experiments on generalization performance. In this paper, we investigate the backbone networks (e.g. CNNs, Transformers, and CNN-Transformer hybrid models) toward the generalization of monocular depth estimation. We first evaluate state-of-the-art models on diverse public datasets, which have never been seen during the network training. Next, we investigate the effects of texture-biased and shape-biased representations using the various texture-shifted datasets that we generated. We observe that Transformers exhibit a strong shape bias and CNNs do a strong texture-bias. We also find that shape-biased models show better generalization performance for monocular depth estimation compared to texture-biased models. Based on these observations, we newly design a CNN-Transformer hybrid network with a multi-level adaptive feature fusion module, called MonoFormer. The design intuition behind MonoFormer is to increase shape bias by employing Transformers while compensating for the weak locality bias of Transformers by adaptively fusing multi-level representations. Extensive experiments show that the proposed method achieves state-of-the-art performance with various public datasets. Our method also shows the best generalization ability among the competitive methods.
翻译:最近对自监督的单层深度估算进行了广泛研究。 大部分工作侧重于改进KITTI等基准数据集的性能, 但也提供了一些关于一般性能的实验。 在本文中, 我们调查主干网络( 如CNN、 变换器和CNN- Transformine混合模型) 以概括单一深度估算。 我们首先评估了在网络培训期间从未见过的关于各种公共数据集的最新先进模型。 其次, 我们利用我们生成的各种质性变数据集, 调查了质性偏向和形状偏向性表示的影响。 我们观察到, 变形人表现出强烈的形状偏向性, CNN- 偏向型混合网络( 如CNN- Transformer ), 并调查了各种结构性能表现的影响。 我们发现, 变形型模型显示, 与纹理性能模型相比, 单层深度估计效果更优。 基于这些观察, 我们新设计了一个CNN- 变形混合混合网络, 其多层次的适应性组合模块, 叫做MonoFormer。 后的设计直视力背后, 也通过采用弹性的变形模型, 演变形法, 演变形方法, 演变形的模型, 演变形, 演变形法, 演变形更变形的演变形法, 演变形性更变形性更形性更形法, 演变形法, 演变形性更形法, 演变制制制。