Visual SLAM - Simultaneous Localization and Mapping - in dynamic environments typically relies on identifying and masking image features on moving objects to prevent them from negatively affecting performance. Current approaches are suboptimal: they either fail to mask objects when needed or, on the contrary, mask objects needlessly. Thus, we propose a novel SLAM that learns when masking objects improves its performance in dynamic scenarios. Given a method to segment objects and a SLAM, we give the latter the ability of Temporal Masking, i.e., to infer when certain classes of objects should be masked to maximize any given SLAM metric. We do not make any priors on motion: our method learns to mask moving objects by itself. To prevent high annotations costs, we created an automatic annotation method for self-supervised training. We constructed a new dataset, named ConsInv, which includes challenging real-world dynamic sequences respectively indoors and outdoors. Our method reaches the state of the art on the TUM RGB-D dataset and outperforms it on KITTI and ConsInv datasets.
翻译:视觉 SLAM - 同步定位和映射 - 在动态环境中, 通常依赖于在移动对象上识别和遮盖图像特征, 以防止它们影响性能。 目前的方法并不理想: 它们要么在需要时无法遮盖物体, 要么毫无必要地遮盖物体。 因此, 我们提议了一个新的 SLAM, 在遮盖物体在动态情景中提高性能时可以学习。 根据对物体进行分割的方法和SLAM, 我们给后者提供Temal Masking 的能力, 即当某些种类的物体被遮盖以最大限度地扩大任何给定的 SLAM 度标度时, 我们可推断出某些种类的物体在运动上的位置: 我们的方法是自己学会遮盖移动物体。 为了防止高说明成本, 我们为自我监督培训创建了自动注解方法。 我们建造了一个新的数据集, 名为 Consinv, 其中包括挑战真实的室内和室外动态序列。 我们的方法在 TUM RGBD 数据集和 Cons Inv 数据集上达到了艺术状态。