Tracking objects over long videos effectively means solving a spectrum of problems, from short-term association for un-occluded objects to long-term association for objects that are occluded and then reappear in the scene. Methods tackling these two tasks are often disjoint and crafted for specific scenarios, and top-performing approaches are often a mix of techniques, which yields engineering-heavy solutions that lack generality. In this work, we question the need for hybrid approaches and introduce SUSHI, a unified and scalable multi-object tracker. Our approach processes long clips by splitting them into a hierarchy of subclips, which enables high scalability. We leverage graph neural networks to process all levels of the hierarchy, which makes our model unified across temporal scales and highly general. As a result, we obtain significant improvements over state-of-the-art on four diverse datasets. Our code and models are available at bit.ly/sushi-mot.
翻译:跨越长视频追踪对象有效地意味着解决一系列问题,从未遮挡对象的短期关联到遮挡并重新出现在场景中的对象的长期关联。处理这两个任务的方法通常是不相交的,并且是针对特定场景进行设计的,最好的方法通常是一系列技术的结合,这导致了重度工程化的解决方案,缺乏普适性。在这项工作中,我们质疑混合方法的必要性,并引入 SUSHI,一种统一且可扩展的多对象跟踪器。我们的方法通过将长片剪成一个子剪辑层次结构来处理,从而实现高可扩展性。我们利用图神经网络处理层次结构的所有级别,使我们的模型跨时间尺度统一且高度普适。因此,我们在四个不同的数据集上获得了显着的改进。我们的代码和模型可在 bit.ly/sushi-mot 上获取。