Self-supervised monocular depth estimation that does not require ground truth for training has attracted attention in recent years. It is of high interest to design lightweight but effective models so that they can be deployed on edge devices. Many existing architectures benefit from using heavier backbones at the expense of model sizes. This paper achieves comparable results with a lightweight architecture. Specifically, the efficient combination of CNNs and Transformers is investigated, and a hybrid architecture called Lite-Mono is presented. A Consecutive Dilated Convolutions (CDC) module and a Local-Global Features Interaction (LGFI) module are proposed. The former is used to extract rich multi-scale local features, and the latter takes advantage of the self-attention mechanism to encode long-range global information into the features. Experiments demonstrate that Lite-Mono outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters.
翻译:近些年来,无需实地了解培训的自我监督单眼深度估计引起了人们的注意。设计轻量级但有效的模型,以便能在边缘设备上部署这些模型非常有意义。许多现有结构受益于使用较重的脊柱,而忽略模型大小。本文以轻量级结构取得了可比结果。具体地说,对CNN和变异器的有效组合进行了调查,并提出了称为Lite-Mono的混合结构。提出了一个连续消化演动模块和一个地方-全球地貌互动模块。前者用于提取丰富的多尺度本地特征,后者利用自留机制将远程全球信息输入这些特征。实验表明,Lite-Mono的光度比光深2型精度大,其培训参数要少80%。</s>