Domain Generalization (DG) aims to generalize a model trained on multiple source domains to an unseen target domain. The source domains always require precise annotations, which can be cumbersome or even infeasible to obtain in practice due to the vast amount of data involved. Web data, however, offers an opportunity to access large amounts of unlabeled data with rich style information, which can be leveraged to improve DG. From this perspective, we introduce a novel paradigm of DG, termed as Semi-Supervised Domain Generalization (SSDG), to explore how the labeled and unlabeled source domains can interact, and establish two settings, including the close-set and open-set SSDG. The close-set SSDG is based on existing public DG datasets, while the open-set SSDG, built on the newly-collected web-crawled datasets, presents a novel yet realistic challenge that pushes the limits of current technologies. A natural approach of SSDG is to transfer knowledge from labeled data to unlabeled data via pseudo labeling, and train the model on both labeled and pseudo-labeled data for generalization. Since there are conflicting goals between domain-oriented pseudo labeling and out-of-domain generalization, we develop a pseudo labeling phase and a generalization phase independently for SSDG. Unfortunately, due to the large domain gap, the pseudo labels provided in the pseudo labeling phase inevitably contain noise, which has negative affect on the subsequent generalization phase. Therefore, to improve the quality of pseudo labels and further enhance generalizability, we propose a cyclic learning framework to encourage a positive feedback between these two phases, utilizing an evolving intermediate domain that bridges the labeled and unlabeled domains in a curriculum learning manner...
翻译:领域泛化(DG)的目的是将在多个源领域训练的模型推广到未见过的目标领域。源领域总是需要精确注释,由于所涉及的数据量巨大,在实践中获取这些注释可能是困难甚至不可行的。然而,网络数据提供了一种机会,可以访问具有丰富风格信息的大量未标记数据,这可以用来改进DG。从这个角度出发,我们引入了一种新的DG范例,称为半监督领域泛化(SSDG),以探索标记和未标记源域的交互作用,并建立两个设置,包括关闭的集合和开放的集合SSDG。关闭的集合SSDG基于现有的公共DG数据集,而基于新采集的网络爬取数据集的开放式SSDG提供了一个新的、现实的挑战,推动了当前技术的极限。SSDG的自然方法是通过伪标记将知识从标记数据转移到未标记数据,然后训练模型以进行泛化。由于领域导向的伪标记和领域外泛化之间存在冲突的目标,我们为SSDG研发了伪标记和泛化两个阶段。不幸的是,由于存在很大的领域差距,伪标签在伪标记阶段中提供的标签必然包含噪声,这对随后的泛化阶段具有负面影响。因此,为了提高伪标记的质量,并进一步增强泛化能力,我们提出了一种循环学习框架,使用进化中间域以课程学习方式桥接标记和未标记域,鼓励这两个阶段之间的积极反馈...