Accurate and consistent 3D tracking from multiple cameras is a key component in a vision-based autonomous driving system. It involves modeling 3D dynamic objects in complex scenes across multiple cameras. This problem is inherently challenging due to depth estimation, visual occlusions, appearance ambiguity, etc. Moreover, objects are not consistently associated across time and cameras. To address that, we propose an end-to-end \textbf{MU}lti-camera \textbf{TR}acking framework called MUTR3D. In contrast to prior works, MUTR3D does not explicitly rely on the spatial and appearance similarity of objects. Instead, our method introduces \textit{3D track query} to model spatial and appearance coherent track for each object that appears in multiple cameras and multiple frames. We use camera transformations to link 3D trackers with their observations in 2D images. Each tracker is further refined according to the features that are obtained from camera images. MUTR3D uses a set-to-set loss to measure the difference between the predicted tracking results and the ground truths. Therefore, it does not require any post-processing such as non-maximum suppression and/or bounding box association. MUTR3D outperforms state-of-the-art methods by 5.3 AMOTA on the nuScenes dataset. Code is available at: \url{https://github.com/a1600012888/MUTR3D}.
翻译:从多个相机进行精密和一致的 3D 跟踪是基于视觉的自主驱动系统中的一个关键组成部分。 它涉及在多个相机的复杂场景中建模 3D 动态物体。 由于深度估计、 视觉隔离、 外观模糊等原因, 这一问题具有内在的挑战性。 此外, 时间和相机之间没有连贯一致地连接物体。 为此, 我们提议了一个名为 MUTR3D 的基于视觉的自动驱动系统。 与先前的工程相比, MUTR3D 并不明确依赖物体的空间和外观相似性。 相反, 我们的方法引入了\ textit{ 3D 轨道查询} 来模拟在多个相机和多框中出现的每个物体的空间和外观一致的轨迹。 我们用相机转换将 3D 跟踪器与在 2D 图像中的观测结果连接起来。 每个跟踪器都根据从相机图像中获得的特征进行进一步的改进。 MUTR3DMUS 使用设置到设置损失来测量预测的跟踪结果与地面真相之间的差别。 因此, 我们要求使用一个固定的后方位/ DMUD 方法 。 。 在 ADS 上, 它不要求任何固定式 。