Depth-aware video panoptic segmentation tackles the inverse projection problem of restoring panoptic 3D point clouds from video sequences, where the 3D points are augmented with semantic classes and temporally consistent instance identifiers. We propose a novel solution with a multi-task network that performs monocular depth estimation and video panoptic segmentation. Since acquiring ground truth labels for both depth and image segmentation has a relatively large cost, we leverage the power of unlabeled video sequences with self-supervised monocular depth estimation and semi-supervised learning from pseudo-labels for video panoptic segmentation. To further improve the depth prediction, we introduce panoptic-guided depth losses and a novel panoptic masking scheme for moving objects to avoid corrupting the training signal. Extensive experiments on the Cityscapes-DVPS and SemKITTI-DVPS datasets demonstrate that our model with the proposed improvements achieves competitive results and fast inference speed.
翻译:通过视频序列恢复3D点云的反向预测问题,即3D点以语义类和时间一致性实例识别器来增加3D点点,我们提出一个新的解决方案,即多任务网络,进行单层深度估计和视频全光分割。由于为深度和图像分割获取地面真相标签的成本相对较高,我们利用无标签视频序列的功率,通过自我监督的单眼深度估计和从视频全光分割的假标签中半监督学习。为了进一步改善深度预测,我们引入了全光引导深度损失和新的全光遮罩方案,用于移动物体,以避免损坏培训信号。关于城市景景-DVPS和SemKITTI-DVPS数据集的广泛实验表明,我们提议的改进模型取得了竞争性结果和快速推论速度。