Association, aiming to link bounding boxes of the same identity in a video sequence, is a central component in multi-object tracking (MOT). To train association modules, e.g., parametric networks, real video data are usually used. However, annotating person tracks in consecutive video frames is expensive, and such real data, due to its inflexibility, offer us limited opportunities to evaluate the system performance w.r.t changing tracking scenarios. In this paper, we study whether 3D synthetic data can replace real-world videos for association training. Specifically, we introduce a large-scale synthetic data engine named MOTX, where the motion characteristics of cameras and objects are manually configured to be similar to those in real-world datasets. We show that compared with real data, association knowledge obtained from synthetic data can achieve very similar performance on real-world test sets without domain adaption techniques. Our intriguing observation is credited to two factors. First and foremost, 3D engines can well simulate motion factors such as camera movement, camera view and object movement, so that the simulated videos can provide association modules with effective motion features. Second, experimental results show that the appearance domain gap hardly harms the learning of association knowledge. In addition, the strong customization ability of MOTX allows us to quantitatively assess the impact of motion factors on MOT, which brings new insights to the community.
翻译:在视频序列中,旨在将同一身份的捆绑盒连接起来的协会是多对象跟踪(MOT)的一个核心组成部分。为了培训关联模块,例如参数网络,通常使用真实的视频数据。然而,连续视频框中的批注人轨道费用昂贵,而由于这种真实数据不灵活,因此我们评估系统性能的机会有限。在本文中,我们研究3D合成数据能否取代真实世界的视频进行关联培训。具体地说,我们推出一个名为MOTX的大型合成数据引擎,其中相机和物体的动作特征手工配置类似于真实世界数据集。我们表明,与真实数据相比,从合成数据中获得的关联知识可以在没有域适应技术的情况下在真实世界测试组上实现非常相似的性能。我们的诱人观察可归功于两个因素。首先,3D引擎可以很好地模拟摄影机运动、摄像视图和物体移动等运动要素。因此,模拟的合成数据引擎可以提供具有有效运动特征的联系模块。第二,实验性结果显示,与合成数据组获得的组合能力可以使我们了解强的动态的图像。