Self-supervised multi-object trackers have tremendous potential as they enable learning from raw domain-specific data. However, their re-identification accuracy still falls short compared to their supervised counterparts. We hypothesize that this drawback results from formulating self-supervised objectives that are limited to single frames or frame pairs. Such formulations do not capture sufficient visual appearance variations to facilitate learning consistent re-identification features for autonomous driving when the frame rate is low or object dynamics are high. In this work, we propose a training objective that enables self-supervised learning of re-identification features from multiple sequential frames by enforcing consistent association scores across short and long timescales. We perform extensive evaluations demonstrating that re-identification features trained from longer sequences significantly reduce ID switches on standard autonomous driving datasets compared to existing self-supervised learning methods, which are limited to training on frame pairs. Using our proposed SubCo loss function, we set the new state-of-the-art among self-supervised methods and even perform on par with fully supervised learning methods.
翻译:暂无翻译