Learning-based visual odometry (VO) algorithms achieve remarkable performance on common static scenes, benefiting from high-capacity models and massive annotated data, but tend to fail in dynamic, populated environments. Semantic segmentation is largely used to discard dynamic associations before estimating camera motions but at the cost of discarding static features and is hard to scale up to unseen categories. In this paper, we leverage the mutual dependence between camera ego-motion and motion segmentation and show that both can be jointly refined in a single learning-based framework. In particular, we present DytanVO, the first supervised learning-based VO method that deals with dynamic environments. It takes two consecutive monocular frames in real-time and predicts camera ego-motion in an iterative fashion. Our method achieves an average improvement of 27.7% in ATE over state-of-the-art VO solutions in real-world dynamic environments, and even performs competitively among dynamic visual SLAM systems which optimize the trajectory on the backend. Experiments on plentiful unseen environments also demonstrate our method's generalizability.
翻译:以学习为基础的视觉测量(VO)算法在普通静态场景上取得了显著的性能,受益于高容量模型和大量附加说明的数据,但往往在动态、人口密集的环境中失败。语义分割法主要用于在估计相机动作之前抛弃动态关联,但代价是丢弃静态特征,难以推广到不可见的类别。在本文中,我们利用相机自动和动作分割法之间的相互依存关系,并表明两者都可以在一个单一的基于学习的框架里共同改进。特别是,我们介绍DytanVO,这是第一个以学习为基础的、以动态环境为主的VO方法。它需要实时两个连续的单眼框,并且以互动的方式预测相机自我感动。我们的方法在现实世界动态环境中平均改进了27.7%的ATE,甚至在能动的视觉 SLM系统之间进行竞争性的演练,这些系统可以优化后端的轨迹。对宽阔的无形环境的实验也显示了我们的方法的通用性。