Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction. In this paper, we propose a Spatial-Temporal Semantic Consistency method to capture class-exclusive context information. Specifically, we design a spatial-temporal consistency loss to constrain the semantic consistency in spatial and temporal dimensions. In addition, we adopt an pseudo-labeling strategy to enrich the training dataset. We obtain the scores of 59.84% and 58.85% mIoU on development (test part 1) and testing set of VSPW, respectively. And our method wins the 1st place on VSPW challenge at ICCV2021.
翻译:与图像场景分析相比,视频场景分析引入了时间信息,这可以有效提高预测的一致性和准确性。在本文中,我们建议采用空间-临时语义一致性方法捕捉类排他性背景信息。具体地说,我们设计了空间-时间一致性损失,以限制空间和时间层面的语义一致性。此外,我们还采用了假标签战略来丰富培训数据集。我们在开发(测试部分1)和测试VSPW方面分别获得了59.84%和58.85% mIoU的分数。我们在ICCV2021中赢得了VSPW挑战的第1位。