Self-supervised monocular depth estimation has been widely studied recently. Most of the work has focused on improving performance on benchmark datasets, such as KITTI, but has offered a few experiments on generalization performance. In this paper, we investigate the backbone networks (e.g. CNNs, Transformers, and CNN-Transformer hybrid models) toward the generalization of monocular depth estimation. We first evaluate state-of-the-art models on diverse public datasets, which have never been seen during the network training. Next, we investigate the effects of texture-biased and shape-biased representations using the various texture-shifted datasets that we generated. We observe that Transformers exhibit a strong shape bias and CNNs do a strong texture-bias. We also find that shape-biased models show better generalization performance for monocular depth estimation compared to texture-biased models. Based on these observations, we newly design a CNN-Transformer hybrid network with a multi-level adaptive feature fusion module, called MonoFormer. The design intuition behind MonoFormer is to increase shape bias by employing Transformers while compensating for the weak locality bias of Transformers by adaptively fusing multi-level representations. Extensive experiments show that the proposed method achieves state-of-the-art performance with various public datasets. Our method also shows the best generalization ability among the competitive methods.
翻译:自监督单目深度估计近期受到广泛研究。大部分工作关注提高在基准数据集(如KITTI)上的性能,但在泛化性能上给出的实验较少。本文研究了主干网络(例如CNN、Transformer和CNN-Transformer混合模型)对单目深度估计的泛化能力。首先在多样的公共数据集上评估了最先进的模型,这些数据集在网络训练期间从未被观察过。接着,使用我们生成的多个不同纹理偏移数据集研究了纹理偏向和形状偏向两种表征的影响。我们观察到Transformers有很强的形状偏向而CNNs则有很强的纹理偏向。我们还发现,比起纹理偏向模型,形状偏向模型对单目深度估计展示出更好的泛化性能。基于这些观测结果,我们新设计了一个CNN-Transformer混合网络,其中包括一个多级自适应特征融合模块,称为MonoFormer。MonoFormer的设计思想是通过采用Transformers来增加形状偏向,同时通过自适应地融合多级表示来弥补Transformers的弱点。大量实验证明,所提出的方法在各种公共数据集上均取得了最先进的性能。我们的方法还显示出最佳的泛化能力,优于竞争方法。