Visual SLAM -- Simultaneous Localization and Mapping -- in dynamic environments typically relies on identifying and masking image features on moving objects to prevent them from negatively affecting performance. Current approaches are suboptimal: they either fail to mask objects when needed or, on the contrary, mask objects needlessly. Thus, we propose a novel SLAM that learns when masking objects improves its performance in dynamic scenarios. Given a method to segment objects and a SLAM, we give the latter the ability of Temporal Masking, i.e., to infer when certain classes of objects should be masked to maximize any given SLAM metric. We do not make any priors on motion: our method learns to mask moving objects by itself. To prevent high annotations costs, we created an automatic annotation method for self-supervised training. We constructed a new dataset, named ConsInv, which includes challenging real-world dynamic sequences respectively indoors and outdoors. Our method reaches the state of the art on the TUM RGB-D dataset and outperforms it on KITTI and ConsInv datasets.
翻译:视觉 SLAM -- -- 同步定位和绘图 -- -- 在动态环境中,通常依赖在移动对象上识别和遮盖图像特征,以防止它们影响性能。目前的方法不理想:它们要么在需要时无法遮盖物体,要么毫无必要地遮盖物体。因此,我们提议了一个新的SLAM,在遮盖物体在动态情景中提高性能时学习。根据对物体进行分割的方法和SLAM,我们给后者提供Temoral遮蔽的能力,即当某些类别的物体被遮盖以最大限度地扩大任何给定的SLAM测量值时,我们可以推断。我们没有在运动上做任何前缀:我们的方法是自己学会遮掩移动物体。为了防止较高的说明费用,我们为自我监督培训创建了自动注解方法。我们建造了一个新的数据集,名为 Consinv,其中包括挑战性真实的室内和室外动态序列。我们的方法在TUM RGBD数据集和Cons Inv数据集上达到了艺术状态。