The network trained for domain adaptation is prone to bias toward the easy-to-transfer classes. Since the ground truth label on the target domain is unavailable during training, the bias problem leads to skewed predictions, forgetting to predict hard-to-transfer classes. To address this problem, we propose Cross-domain Moving Object Mixing (CMOM) that cuts several objects, including hard-to-transfer classes, in the source domain video clip and pastes them into the target domain video clip. Unlike image-level domain adaptation, the temporal context should be maintained to mix moving objects in two different videos. Therefore, we design CMOM to mix with consecutive video frames, so that unrealistic movements are not occurring. We additionally propose Feature Alignment with Temporal Context (FATC) to enhance target domain feature discriminability. FATC exploits the robust source domain features, which are trained with ground truth labels, to learn discriminative target domain features in an unsupervised manner by filtering unreliable predictions with temporal consensus. We demonstrate the effectiveness of the proposed approaches through extensive experiments. In particular, our model reaches mIoU of 53.81% on VIPER to Cityscapes-Seq benchmark and mIoU of 56.31% on SYNTHIA-Seq to Cityscapes-Seq benchmark, surpassing the state-of-the-art methods by large margins.
翻译:为适应领域而培训的网络容易偏向于容易传输的类。 由于培训期间没有目标领域的地面真相标签, 偏差问题导致预测偏斜, 忘记预测难以传输的类。 为了解决这个问题, 我们提议跨多边移动对象混合( CMOM), 切除包括难以传输类在内的多个对象, 包括在源域视频剪辑中, 将其粘贴到目标域视频剪辑中。 与图像级域适应不同, 时间环境应该维持, 将移动对象混合在两个不同的视频中。 因此, 我们设计 CMOM 与连续的视频框架混在一起, 这样就不会出现不切实际的移动。 我们还提议与Temoral环境( FATC) 进行功能调整, 以加强目标域的特性差异性。 FATC 利用强大的源域特征, 包括难以传输的分类, 通过过滤不可靠的预测和时间性共识, 以不超强的方式学习歧视性目标域域特征。 我们通过广泛的实验来展示拟议方法的有效性。 特别是, 我们的模型在城市基准点56. 81%的MS- refreabionS