The Depth-aware Video Panoptic Segmentation (DVPS) is a new challenging vision problem that aims to predict panoptic segmentation and depth in a video simultaneously. The previous work solves this task by extending the existing panoptic segmentation method with an extra dense depth prediction and instance tracking head. However, the relationship between the depth and panoptic segmentation is not well explored -- simply combining existing methods leads to competition and needs carefully weight balancing. In this paper, we present PolyphonicFormer, a vision transformer to unify these sub-tasks under the DVPS task and lead to more robust results. Our principal insight is that the depth can be harmonized with the panoptic segmentation with our proposed new paradigm of predicting instance level depth maps with object queries. Then the relationship between the two tasks via query-based learning is explored. From the experiments, we demonstrate the benefits of our design from both depth estimation and panoptic segmentation aspects. Since each thing query also encodes the instance-wise information, it is natural to perform tracking directly with appearance learning. Our method achieves state-of-the-art results on two DVPS datasets (Semantic KITTI, Cityscapes), and ranks 1st on the ICCV-2021 BMTT Challenge video + depth track. Code is available at https://github.com/HarborYuan/PolyphonicFormer .
翻译:深觉视频光学截面( DVPS) 是一个新的富有挑战性的视觉问题, 目的是同时在视频中预测全光截面和深度。 先前的工作通过扩展现有的全光截面方法, 以额外的深度预测和实例跟踪头部来延长现有的全光截面方法, 从而解决这个问题。 但是, 深度和全光截面截面之间的关系没有很好地探索 -- 仅仅将现有方法结合起来就能导致竞争, 并且需要仔细权衡重量。 在本文中, 我们展示了多光谱格式, 一个将这些子任务统一在 DVPS 任务之下, 并导致更强有力的结果。 我们的主要洞察力是, 深度与全光截面截面可以与我们提议的通过对象查询来预测实例深度地图的新模式相协调。 然后通过基于查询的学习来探索这两项任务之间的关系。 我们通过实验来展示我们的设计从深度估量和光谱截面截面截面平衡两个方面的好处。 由于每件的查询也会对实例信息进行编码, 因此自然直接进行跟踪。 我们的方法可以在两个 DVPS 20 图像剖面 数据在两个 CEVBS/ TVBS/ TVS 上获得两个 CEVI 。 Kard 1 。