For models to generalize under unseen domains (a.k.a domain generalization), it is crucial to learn feature representations that are domain-agnostic and capture the underlying semantics that makes up an object category. Recent advances towards weakly supervised vision-language models that learn holistic representations from cheap weakly supervised noisy text annotations have shown their ability on semantic understanding by capturing object characteristics that generalize under different domains. However, when multiple source domains are involved, the cost of curating textual annotations for every image in the dataset can blow up several times, depending on their number. This makes the process tedious and infeasible, hindering us from directly using these supervised vision-language approaches to achieve the best generalization on an unseen domain. Motivated from this, we study how multimodal information from existing pre-trained multimodal networks can be leveraged in an "intrinsic" way to make systems generalize under unseen domains. To this end, we propose IntriNsic multimodality for DomaIn GeneralizatiOn (INDIGO), a simple and elegant way of leveraging the intrinsic modality present in these pre-trained multimodal networks along with the visual modality to enhance generalization to unseen domains at test-time. We experiment on several Domain Generalization settings (ClosedDG, OpenDG, and Limited sources) and show state-of-the-art generalization performance on unseen domains. Further, we provide a thorough analysis to develop a holistic understanding of INDIGO.
翻译:对于在隐蔽域(a.k.a.a.a.a.a.域域一般化)下推广模型而言,关键是要了解作为对象类别的基本语义的特征表示;最近向监督不力的视觉语言模型的进展,这些模型从廉价的、监督不力的、吵闹的文本说明中学习整体代表,这些模型从廉价的、监督不力的、烦燥的文本说明中学习,这些模型通过捕捉分散在不同域下的物体特征,显示了其语义理解能力;然而,当涉及多个源域时,根据数据集中每个图像的数量,对每个图像的校正文本说明的费用可能会发生几次爆炸。这使这一过程变得乏味和不可行,使我们无法直接使用这些受监督的视觉语言方法,在无形域内实现最佳的概括化。我们为此研究如何以“内在”的方式利用现有经过预先训练的多式联运网络的多种信息,使系统在隐蔽域内普遍化。我们建议DGOO(INGGG)进一步利用这些经过事先训练的多式联运网络的内在模式,通过视觉方式加强一般化和一般DGDGDD(ODDD)的系统,在各种领域进行测试,提供一般化和一般化的系统,在一般化方面的试验,在一般化的理论上提供一种普通化。