Semantic aware reconstruction is more advantageous than geometric-only reconstruction for future robotic and AR/VR applications because it represents not only where things are, but also what things are. Object-centric mapping is a task to build an object-level reconstruction where objects are separate and meaningful entities that convey both geometry and semantic information. In this paper, we present MOLTR, a solution to object-centric mapping using only monocular image sequences and camera poses. It is able to localise, track, and reconstruct multiple objects in an online fashion when an RGB camera captures a video of the surrounding. Given a new RGB frame, MOLTR firstly applies a monocular 3D detector to localise objects of interest and extract their shape codes that represent the object shapes in a learned embedding space. Detections are then merged to existing objects in the map after data association. Motion state (i.e. kinematics and the motion status) of each object is tracked by a multiple model Bayesian filter and object shape is progressively refined by fusing multiple shape code. We evaluate localisation, tracking, and reconstruction on benchmarking datasets for indoor and outdoor scenes, and show superior performance over previous approaches.
翻译:静态意识重建比未来机器人和AR/VR应用程序只进行几何重建更为有利,因为它不仅代表事物所在的位置,也代表事物所在。 以物体为中心的绘图是一项任务, 以建立对象层面的重建, 其中对象是分别和有意义的实体, 传递几何和语义信息。 在本文中, 我们展示了 MOLTR, 这是仅使用单方图像序列和相机配置的物体中心映像的解决方案。 当 RGB 相机捕捉周围的视频时, 它能够以在线方式定位、 跟踪并重建多个对象。 在新的 RGB 框架下, MOLTR 首先应用一个单方位 3D 探测器来定位相关对象, 并提取这些对象的形状代码, 以在学习的嵌入空间中代表对象形状。 检测结果随后被合并到地图中的现有对象 。 每个对象的动态状态( 即运动状态和运动状态) 由多个模型 Bayesian 过滤器和对象形状通过使用多个形状代码逐渐精细化。 我们评估了定位、 跟踪、 并重建了室内和前镜像像片显示的高级性。