The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video. Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks. A single TarViS model can be trained jointly on a collection of datasets spanning different tasks, and can hot-swap between tasks during inference without any task-specific retraining. To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking (PET). Our unified, jointly trained model achieves state-of-the-art performance on 5/7 benchmarks spanning these four tasks, and competitive performance on the remaining two.
翻译:视频分割的一般领域目前被分割成不同的任务,涉及多个基准。尽管在最新技术方面进展迅速,但目前的方法绝大多数是任务特有的,无法在概念上概括到其他任务。在具有多任务能力的最近方法的启发下,我们提议TarViS:一个新颖、统一的网络架构,可以适用于任何需要将一组任意界定的“目标”在视频中进行分割的任何任务。我们的方法在任务定义这些目标方面是灵活的,因为它将这些目标作为抽象的“查询”来模拟,然后用来预测像素精密的目标面具。一个单一的TarVIS模型可以联合培训如何收集涵盖不同任务的数据集,在推断过程中可以互换任务,而无需任何特定任务再培训。为了证明其有效性,我们将TarViS应用到四个不同的任务上,即视频实例分级(VIS)、视频全截面分级(VPS)、视频对象分级(VOS)和点点点导导跟踪(PET)。我们统一、共同培训的模型可以在其余的5/7项基准上实现州竞争业绩。