Visual SLAM systems targeting static scenes have been developed with satisfactory accuracy and robustness. Dynamic 3D object tracking has then become a significant capability in visual SLAM with the requirement of understanding dynamic surroundings in various scenarios including autonomous driving, augmented and virtual reality. However, performing dynamic SLAM solely with monocular images remains a challenging problem due to the difficulty of associating dynamic features and estimating their positions. In this paper, we present MOTSLAM, a dynamic visual SLAM system with the monocular configuration that tracks both poses and bounding boxes of dynamic objects. MOTSLAM first performs multiple object tracking (MOT) with associated both 2D and 3D bounding box detection to create initial 3D objects. Then, neural-network-based monocular depth estimation is applied to fetch the depth of dynamic features. Finally, camera poses, object poses, and both static, as well as dynamic map points, are jointly optimized using a novel bundle adjustment. Our experiments on the KITTI dataset demonstrate that our system has reached best performance on both camera ego-motion and object tracking on monocular dynamic SLAM.
翻译:以静态场景为对象的SLAM系统已经发展得令人满意、准确和稳健。动态三维天体跟踪随后成为视觉SLAM的一个巨大能力,需要了解各种情景中的动态环境,包括自主驱动、增强和虚拟现实。然而,仅仅用单镜图像进行动态SLAM仍是一个具有挑战性的问题,因为难以结合动态特征和估计其位置。在本文中,我们介绍了MOTSLAM,这是一个动态直观的SLAM系统,带有跟踪动态物体构成和捆绑盒的单个配置。MOTSLAM首先进行多天体跟踪(MOT),同时进行 2D 和 3D 捆绑盒检测,以创建初始的 3D 对象。随后,对神经网络的单镜像深度进行了估计,以获取动态特征的深度。最后,通过新式的捆绑式调整,对相机、物体构成和静止以及动态地图点进行了联合优化。我们在KITTI数据集上的实验表明,我们的系统在摄像头自我感和物体跟踪单筒动态SLAM上都达到最佳性。