Tracking objects over long videos effectively means solving a spectrum of problems, from short-term association for un-occluded objects to long-term association for objects that are occluded and then reappear in the scene. Methods tackling these two tasks are often disjoint and crafted for specific scenarios, and top-performing approaches are often a mix of techniques, which yields engineering-heavy solutions that lack generality. In this work, we question the need for hybrid approaches and introduce SUSHI, a unified and scalable multi-object tracker. Our approach processes long clips by splitting them into a hierarchy of subclips, which enables high scalability. We leverage graph neural networks to process all levels of the hierarchy, which makes our model unified across temporal scales and highly general. As a result, we obtain significant improvements over state-of-the-art on four diverse datasets. Our code and models will be made available.
翻译:在长视频上跟踪对象有效意味着解决一系列问题,从短期联系未封闭物体到长期联系被隐蔽的物体,然后在现场重新出现的物体。处理这两个任务的方法往往脱节,为特定情景设计,而最优秀的方法往往是各种技术的组合,产生缺乏普遍性的工程重力解决方案。在这项工作中,我们质疑混合方法的必要性,并引入一个统一和可缩放的多球跟踪器SUSHI。我们的方法处理长片,将它们分成一个子剪切片的层次,从而可以进行高度可缩放。我们利用图形神经网络处理所有层次的层次,从而使我们的模型在时间尺度和高度一般地统一。结果,我们比四个不同的数据集的状态得到了显著的改进。我们的代码和模型将会被提供。