The task of assigning semantic classes and track identities to every pixel in a video is called video panoptic segmentation. Our work is the first that targets this task in a real-world setting requiring dense interpretation in both spatial and temporal domains. As the ground-truth for this task is difficult and expensive to obtain, existing datasets are either constructed synthetically or only sparsely annotated within short video clips. To overcome this, we introduce a new benchmark encompassing two datasets, KITTI-STEP, and MOTChallenge-STEP. The datasets contain long video sequences, providing challenging examples and a test-bed for studying long-term pixel-precise segmentation and tracking under real-world conditions. We further propose a novel evaluation metric Segmentation and Tracking Quality (STQ) that fairly balances semantic and tracking aspects of this task and is more appropriate for evaluating sequences of arbitrary length. Finally, we provide several baselines to evaluate the status of existing methods on this new challenging dataset. We have made our datasets, metric, benchmark servers, and baselines publicly available, and hope this will inspire future research.
翻译:给视频中的每个像素分配语义类和跟踪身份的任务称为视频全光分割。 我们的工作是第一个在现实环境中在需要在空间和时空领域进行密集解释的现实环境中针对这项任务的任务。 由于这项任务的地面真相很难获得而且费用昂贵, 现有的数据集是合成的, 或者是在短视频剪辑中粗略地附加说明的。 为了克服这一点, 我们引入了一个新的基准, 包括两个数据集, KITTI- STEP 和 MTChallenge- STEP。 数据集包含长视频序列, 提供了具有挑战性的实例和一个测试台, 用于研究长期的像素分割和跟踪真实世界条件下的像素分割和跟踪。 我们进一步提议了一个新的评价指标分类和跟踪质量(STQ), 以相对平衡的语义和跟踪任务的各个方面, 并且更适合评估任意长度的序列。 最后, 我们提供了几个基准, 来评估这个具有挑战性的新数据集的现有方法的状况。 我们制作了数据集、 度、 度、 基准服务器和 基准 以及 未来 激励 。