Multi-frame depth estimation generally achieves high accuracy relying on the multi-view geometric consistency. When applied in dynamic scenes, e.g., autonomous driving, this consistency is usually violated in the dynamic areas, leading to corrupted estimations. Many multi-frame methods handle dynamic areas by identifying them with explicit masks and compensating the multi-view cues with monocular cues represented as local monocular depth or features. The improvements are limited due to the uncontrolled quality of the masks and the underutilized benefits of the fusion of the two types of cues. In this paper, we propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the heuristically crafted masks. As unveiled in our analyses, the multi-view cues capture more accurate geometric information in static areas, and the monocular cues capture more useful contexts in dynamic areas. To let the geometric perception learned from multi-view cues in static areas propagate to the monocular representation in dynamic areas and let monocular cues enhance the representation of multi-view cost volume, we propose a cross-cue fusion (CCF) module, which includes the cross-cue attention (CCA) to encode the spatially non-local relative intra-relations from each source to enhance the representation of the other. Experiments on real-world datasets prove the significant effectiveness and generalization ability of the proposed method.
翻译:多帧深度估计通常依赖于多视图几何一致性实现高精度。当应用于动态场景,例如自动驾驶时,该一致性通常会在动态区域中被违反,导致估计结果出现错误。许多多帧方法通过使用显式掩码来识别动态区域,并使用本地单目深度或特征表示的单目线索来补偿多视觉线索处理动态区域,改善效果有限,因为掩码的质量无法控制,并且两种线索的融合效果未被充分利用。本文提出了一种新颖的方法,学习将编码为体积的多视觉和单目线索融合,无需手工设计掩码。正如我们在分析中揭示的那样,多视觉线索捕捉到静态区域中更准确的几何信息,而单目线索捕获到动态区域中更有用的上下文信息。为了让从多观察线索学习的几何感知在动态区域的单目表示中传播,并让单目线索增强多视觉代价体积的表示,我们提出了一个跨线索融合(CCF)模块,其中包括交叉线索注意力(CCA),以编码来自每个源的空间非局部相对关系来增强另一个的表示。实验证明了所提出的方法的显着有效性和泛化能力,在实际数据集上得到了很好的表现。