Tracking objects of interest in a video is one of the most popular and widely applicable problems in computer vision. However, with the years, a Cambrian explosion of use cases and benchmarks has fragmented the problem in a multitude of different experimental setups. As a consequence, the literature has fragmented too, and now novel approaches proposed by the community are usually specialised to fit only one specific setup. To understand to what extent this specialisation is necessary, in this work we present UniTrack, a solution to address five different tasks within the same framework. UniTrack consists of a single and task-agnostic appearance model, which can be learned in a supervised or self-supervised fashion, and multiple ``heads'' that address individual tasks and do not require training. We show how most tracking tasks can be solved within this framework, and that the same appearance model can be successfully used to obtain results that are competitive against specialised methods for most of the tasks considered. The framework also allows us to analyse appearance models obtained with the most recent self-supervised methods, thus extending their evaluation and comparison to a larger variety of important problems.
翻译:跟踪视频中感兴趣的对象是计算机视觉中最受欢迎和最广泛应用的问题之一。 然而,随着这些年,Cambrian使用案例和基准的爆炸在众多不同的实验设置中使问题支离破碎。因此,文献也支离破碎,现在社区提出的新颖方法通常专门适合一个特定的设置。为了了解这种专门化在多大程度上是必要的,我们在此介绍UniTrac, 一种在同一框架内处理五项不同任务的解决方案。UniTractrack 由单一的、任务机密的外观模型组成,该模型可以以监督或自我监督的方式学习,多个“头”的外观模型处理个别任务,不需要培训。我们展示了如何在这个框架内解决大多数跟踪任务,同样的外观模型可以成功地用来取得与所考虑的大多数任务的特殊方法相比具有竞争力的结果。这个框架还使我们能够分析以最新的自我监督方法获得的外观模型,从而将其评价和比较扩大到更广泛的重要问题。