The exploration of mutual-benefit cross-domains has shown great potential toward accurate self-supervised depth estimation. In this work, we revisit feature fusion between depth and semantic information and propose an efficient local adaptive attention method for geometric aware representation enhancement. Instead of building global connections or deforming attention across the feature space without restraint, we bound the spatial interaction within a learnable region of interest. In particular, we leverage geometric cues from semantic information to learn local adaptive bounding boxes to guide unsupervised feature aggregation. The local areas preclude most irrelevant reference points from attention space, yielding more selective feature learning and faster convergence. We naturally extend the paradigm into a multi-head and hierarchic way to enable the information distillation in different semantic levels and improve the feature discriminative ability for fine-grained depth estimation. Extensive experiments on the KITTI dataset show that our proposed method establishes a new state-of-the-art in self-supervised monocular depth estimation task, demonstrating the effectiveness of our approach over former Transformer variants.
翻译:在这项工作中,我们重新审视了深度和语义信息之间的特征融合,并提出了提高几何感知度的高效本地适应性关注方法。我们不是在地貌空间建立全球联系,或不加限制地改变注意力,而是将空间互动束缚在一个可学习的感兴趣区域。特别是,我们利用语义信息中的几何信号学习本地适应性约束框,以引导不受监督的特征集合。当地地区排除了大部分无关的参考点,从而产生更多的选择性特征学习和更快的趋同。我们自然将范式扩展为多头和等级式方法,以便能够在不同语义层次进行信息蒸馏,并提高精确深度估计的特征歧视能力。关于KITTI数据集的广泛实验表明,我们拟议的方法在自我监督的单层深度估算任务中确立了一个新的状态,展示了我们对前变异体方法的有效性。