Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled target domain. The majority of existing UDA-SS works typically consider images whilst recent attempts have extended further to tackle videos by modeling the temporal dimension. Although the two lines of research share the major challenges -- overcoming the underlying domain distribution shift, their studies are largely independent, resulting in fragmented insights, a lack of holistic understanding, and missed opportunities for cross-pollination of ideas. This fragmentation prevents the unification of methods, leading to redundant efforts and suboptimal knowledge transfer across image and video domains. Under this observation, we advocate unifying the study of UDA-SS across video and image scenarios, enabling a more comprehensive understanding, synergistic advancements, and efficient knowledge sharing. To that end, we explore the unified UDA-SS from a general data augmentation perspective, serving as a unifying conceptual framework, enabling improved generalization, and potential for cross-pollination of ideas, ultimately contributing to the overall progress and practical impact of this field of research. Specifically, we propose a Quad-directional Mixup (QuadMix) method, characterized by tackling distinct point attributes and feature inconsistencies through four-directional paths for intra- and inter-domain mixing in a feature space. To deal with temporal shifts with videos, we incorporate optical flow-guided feature aggregation across spatial and temporal dimensions for fine-grained domain alignment. Extensive experiments show that our method outperforms the state-of-the-art works by large margins on four challenging UDA-SS benchmarks. Our source code and models will be released at https://github.com/ZHE-SAPI/UDASS.
翻译:无监督域自适应语义分割旨在将标注源域的监督信息迁移至无标注目标域。现有研究主要集中于图像数据,近期尝试通过建模时序维度进一步扩展至视频处理。尽管这两类研究面临共同的核心挑战——克服潜在的域分布偏移,其研究路径却基本相互独立,导致认知碎片化、缺乏整体性理解,且错失了思想交叉融合的机遇。这种割裂状态阻碍了方法的统一,造成图像与视频域间冗余的研究投入与次优的知识迁移。基于此观察,我们主张统一图像与视频场景下的无监督域自适应语义分割研究,以实现更全面的理解、协同性进展及高效的知识共享。为此,我们从广义数据增强的视角探索统一的无监督域自适应语义分割,将其构建为统一的概念框架,以提升泛化能力并促进思想交叉融合,最终推动该研究领域的整体进展与实践影响。具体而言,我们提出一种四向混合方法,其特点在于通过特征空间内四个方向的域内与跨域混合路径,处理不同的点属性与特征不一致性问题。针对视频的时序偏移,我们引入光流引导的跨时空维度特征聚合机制以实现细粒度域对齐。大量实验表明,我们的方法在四个具有挑战性的无监督域自适应语义分割基准测试中以显著优势超越现有最优方法。源代码与模型将在 https://github.com/ZHE-SAPI/UDASS 发布。