We propose a novel approach for joint 3D multi-object tracking and reconstruction from RGB-D sequences in indoor environments. To this end, we detect and reconstruct objects in each frame while predicting dense correspondences mappings into a normalized object space. We leverage those correspondences to inform a graph neural network to solve for the optimal, temporally-consistent 7-DoF pose trajectories of all objects. The novelty of our method is two-fold: first, we propose a new graph-based approach for differentiable pose estimation over time to learn optimal pose trajectories; second, we present a joint formulation of reconstruction and pose estimation along the time axis for robust and geometrically consistent multi-object tracking. In order to validate our approach, we introduce a new synthetic dataset comprising 2381 unique indoor sequences with a total of 60k rendered RGB-D images for multi-object tracking with moving objects and camera positions derived from the synthetic 3D-FRONT dataset. We demonstrate that our method improves the accumulated MOTA score for all test sequences by 24.8% over existing state-of-the-art methods. In several ablations on synthetic and real-world sequences, we show that our graph-based, fully end-to-end-learnable approach yields a significant boost in tracking performance.
翻译:我们提出了在室内环境中从 RGB-D 序列中联合进行三维多点跟踪和重建的新方法。 为此, 我们检测并重建每个框架中的天体, 同时预测向正常的物体空间绘图的密集通信。 我们利用这些通信为图形神经网络提供信息, 以解决所有物体的最佳、 时间一致的 7- DoF 构成的轨迹。 我们的方法有两重新颖之处: 首先, 我们提出了一个新的基于图形的可变图像估计方法, 以了解最佳的外形轨迹; 第二, 我们提出一个联合的重建配方, 并沿着时间轴进行估算, 以稳健和几何一致的多点跟踪。 为了验证我们的方法, 我们引入一个新的合成数据集数据集, 由2381 个独特的室内序列组成, 共60k 个完成 RGB- D 图像, 用于对来自合成 3D- FRONAT 数据集的移动对象和相机位置进行多点跟踪。 我们展示了我们的方法, 我们的方法改进了所有测试序列的MOTA的累计分数分数分数, 由24.8% 并显示我们现有的州- bl- 平方平方平式的轨道上, 展示了我们以正方平方平方平方的进度的进度。