Depth-aware video panoptic segmentation is a promising approach to camera based scene understanding. However, the current state-of-the-art methods require costly video annotations and use a complex training pipeline compared to their image-based equivalents. In this paper, we present a new approach titled Unified Perception that achieves state-of-the-art performance without requiring video-based training. Our method employs a simple two-stage cascaded tracking algorithm that (re)uses object embeddings computed in an image-based network. Experimental results on the Cityscapes-DVPS dataset demonstrate that our method achieves an overall DVPQ of 57.1, surpassing state-of-the-art methods. Furthermore, we show that our tracking strategies are effective for long-term object association on KITTI-STEP, achieving an STQ of 59.1 which exceeded the performance of state-of-the-art methods that employ the same backbone network.
翻译:然而,目前最先进的方法需要昂贵的视频说明,并使用与图像等同的复杂培训管道。在本文中,我们提出了一个题为“统一概念”的新方法,它实现最新性能而无需视频培训。我们的方法使用简单的两阶段级级级跟踪算法,它(再)使用在图像网络上计算的物体嵌入。城市景景-DVPS数据集的实验结果显示,我们的方法达到了57.1的总DVPQ,超过了最先进的方法。此外,我们表明,我们的跟踪战略对于KITTI-STEP的长期目标关联是有效的,实现了59.1的STQ,超过了使用同一主干网的最先进方法的性能。</s>