Unsupervised Domain Adaptation for semantic segmentation has gained immense popularity since it can transfer knowledge from simulation to real (Sim2Real) by largely cutting out the laborious per pixel labeling efforts at real. In this work, we present a new video extension of this task, namely Unsupervised Domain Adaptation for Video Semantic Segmentation. As it became easy to obtain large-scale video labels through simulation, we believe attempting to maximize Sim2Real knowledge transferability is one of the promising directions for resolving the fundamental data-hungry issue in the video. To tackle this new problem, we present a novel two-phase adaptation scheme. In the first step, we exhaustively distill source domain knowledge using supervised loss functions. Simultaneously, video adversarial training (VAT) is employed to align the features from source to target utilizing video context. In the second step, we apply video self-training (VST), focusing only on the target data. To construct robust pseudo labels, we exploit the temporal information in the video, which has been rarely explored in the previous image-based self-training approaches. We set strong baseline scores on 'VIPER to CityscapeVPS' adaptation scenario. We show that our proposals significantly outperform previous image-based UDA methods both on image-level (mIoU) and video-level (VPQ) evaluation metrics.
翻译:不受监督的语义分化域域适应 已经变得非常受欢迎, 因为它能够将知识从模拟向真实( im2Real) 转移, 因为它可以将知识从模拟向真实( im2Real) 转移( im2Real) 。 在这项工作中, 我们展示了这项任务的一个新的视频扩展, 即无监督的域域适应用于视频语义分化 。 由于通过模拟很容易获得大型视频标签, 我们认为试图尽量扩大Sim2Real知识的可传输性是解决视频中基本数据饥饿问题的有希望的方向之一。 为了解决这一新的问题, 我们提出了一个新型的两阶段适应计划。 在第一步, 我们利用监管的损失功能彻底地提取源域域域知识。 同时, 视频对抗性培训( VAT) 用于利用视频环境将源域的特性与目标相匹配。 在第二步中, 我们应用视频自我训练( VST) 仅侧重于目标数据。 为了构建强大的假称标签, 我们利用视频中的时间信息, 在先前基于图像的自我训练水平上很少探索。 我们设置了强大的基准分级图像模型, 在城市的图像假设上, 我们设置了强烈的校验。