Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes. Most approaches only exploit the temporal dimension to address the association problem, while relying on single frame predictions for the segmentation mask itself. We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation. PCAN first distills a space-time memory into a set of prototypes and then employs cross-attention to retrieve rich information from the past frames. To segment each object, PCAN adopts a prototypical appearance module to learn a set of contrastive foreground and background prototypes, which are then propagated over time. Extensive experiments demonstrate that PCAN outperforms current video instance tracking and segmentation competition winners on both Youtube-VIS and BDD100K datasets, and shows efficacy to both one-stage and two-stage segmentation frameworks. Code will be available at http://vis.xyz/pub/pcan.
翻译:多对象跟踪和分割要求检测、跟踪和分割属于一组特定类别的物体。 多数方法只是利用时间层面来解决关联问题,同时依靠对分割面本身的单一框架预测。 我们提议了Protomic 交叉注意网络(PCAN),它能够利用丰富的时空信息进行在线多物体跟踪和分离。 PCAN首先将时空内存注入一组原型,然后利用交叉注意从过去的框架中检索丰富的信息。 对于每个对象, PCAN采用一个模拟外观模块学习一组对比的地表和背景原型,这些原型随后会随时间传播。 广泛的实验表明, PCAN在Youtube-VIS 和 BDD100K 数据元集上超越了当前视频实例的跟踪和分割竞争赢家,并显示一个阶段和两个阶段的分解框架的功效。 代码将在http://vis.xyz/pub/pcan上查阅。