Multi-frame methods improve monocular depth estimation over single-frame approaches by aggregating spatial-temporal information via feature matching. However, the spatial-temporal feature leads to accuracy degradation in dynamic scenes. To enhance the performance, recent methods tend to propose complex architectures for feature matching and dynamic scenes. In this paper, we show that a simple learning framework, together with designed feature augmentation, leads to superior performance. (1) A novel dynamic objects detecting method with geometry explainability is proposed. The detected dynamic objects are excluded during training, which guarantees the static environment assumption and relieves the accuracy degradation problem of the multi-frame depth estimation. (2) Multi-scale feature fusion is proposed for feature matching in the multi-frame depth network, which improves feature matching, especially between frames with large camera motion. (3) The robust knowledge distillation with a robust teacher network and reliability guarantee is proposed, which improves the multi-frame depth estimation without computation complexity increase during the test. The experiments show that our proposed methods achieve great performance improvement on the multi-frame depth estimation.
翻译:多帧方法通过特征匹配聚合时空信息改善了单帧深度估计。然而,在动态场景下,时空特征会导致精度下降。为了提高性能,最近的方法倾向于提出用于特征匹配和动态场景的复杂架构。本文展示了一个简单学习框架,结合设计的特征增强,可以带来卓越的性能。具体而言,(1)提出了一种新的动态物体检测方法,具有几何可解释性。训练过程中排除检测到的动态物体,确保静态环境假设,并缓解了多帧深度估计中的精度降低问题。(2)提出了多尺度特征融合,用于多帧深度网络中的特征匹配,通过改进特征匹配特别是相机运动较大的帧之间的特征匹配。(3)提出了鲁棒的知识蒸馏,包括强的教师网络和可靠性保证,可以在测试过程中提高多帧深度估计的精度,而不增加计算复杂度。实验表明,我们提出的方法在多帧深度估计上实现了很大的性能改进。