Self-supervised monocular depth estimation that does not require ground-truth for training has attracted attention in recent years. It is of high interest to design lightweight but effective models, so that they can be deployed on edge devices. Many existing architectures benefit from using heavier backbones at the expense of model sizes. In this paper we achieve comparable results with a lightweight architecture. Specifically, we investigate the efficient combination of CNNs and Transformers, and design a hybrid architecture Lite-Mono. A Consecutive Dilated Convolutions (CDC) module and a Local-Global Features Interaction (LGFI) module are proposed. The former is used to extract rich multi-scale local features, and the latter takes advantage of the self-attention mechanism to encode long-range global information into the features. Experiments demonstrate that our full model outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters.
翻译:近些年来,不需要地面真相来进行训练的自我监督单眼深度估计引起了人们的注意。设计轻量级但有效的模型非常有意义,这样可以将其部署在边缘装置上。许多现有结构从使用较重的脊椎中受益,而牺牲模型大小。在本文中,我们利用轻量级结构取得了可比较的结果。具体地说,我们调查CNN和变异器的有效组合,并设计了一个混合结构Lite-Mono。提出了一个连续的循环组合(CDC)模块和一个地方-全球特征互动(LGFI)模块。前者用来提取丰富的多尺度本地特征,而后者利用自用机制将远程全球信息编码到这些特征中。实验表明,我们的整个模型精确地差幅很大,其可培训参数大约为80%。