This paper presents Video K-Net, a simple, strong, and unified framework for fully end-to-end video panoptic segmentation. The method is built upon K-Net, a method that unifies image segmentation via a group of learnable kernels. We observe that these learnable kernels from K-Net, which encode object appearances and contexts, can naturally associate identical instances across video frames. Motivated by this observation, Video K-Net learns to simultaneously segment and track "things" and "stuff" in a video with simple kernel-based appearance modeling and cross-temporal kernel interaction. Despite the simplicity, it achieves state-of-the-art video panoptic segmentation results on Citscapes-VPS, KITTI-STEP, and VIPSeg without bells and whistles. In particular, on KITTI-STEP, the simple method can boost almost 12\% relative improvements over previous methods. On VIPSeg, Video K-Net boosts almost 15\% relative improvements and results in 39.8 % VPQ. We also validate its generalization on video semantic segmentation, where we boost various baselines by 2\% on the VSPW dataset. Moreover, we extend K-Net into clip-level video framework for video instance segmentation, where we obtain 40.5% mAP for ResNet50 backbone and 54.1% mAP for Swin-base on YouTube-2019 validation set. We hope this simple, yet effective method can serve as a new, flexible baseline in unified video segmentation design. Both code and models are released at https://github.com/lxtGH/Video-K-Net.
翻译:本文展示了视频 K- Net, 是一个简单、 强大和统一的全端至端视频光学分割框架。 方法建在 K- Net上, 这个方法通过一组可学习的内核将图像分割统一起来。 我们观察到, K- Net的这些可学习的内核可以自然地将相同的实例连接到视频框架中。 视频 K- Net 受此观察的驱动, K- Net 学会同时进行“ 20 ” 和“ 附加 ” 。 在视频中, K- Net 以简单的内核外观模型和跨同步内核互动为基础, K- Net 。 尽管这个方法很简单, 但它在 Citscaps- VPS、 KITTI- STEP 和 VVISEG- 中实现了最先进的视频分割结果。 在 VITTI- STEP 上, 简单的方法可以促进近12 相对的改进。 在 VISO19 上, 视频 K- 网络 推进近15 相对的改进 和39.8 VP- Q 。 在 VP- AL- 节中, 我们在 SV- 的 V- sde- serve- silation 格式上, 在S- serve- serve- serve- slation lap lapal lave lap lap lap lap laveal lapal 上, 在S- sl lapal lapal lapal lacilation 。