Reducing the representational discrepancy between source and target domains is a key component to maximize the model generalization. In this work, we advocate for leveraging natural language supervision for the domain generalization task. We introduce two modules to ground visual representations with texts containing typical reasoning of humans: (1) Visual and Textual Joint Embedder and (2) Textual Explanation Generator. The former learns the image-text joint embedding space where we can ground high-level class-discriminative information into the model. The latter leverages an explainable model and generates explanations justifying the rationale behind its decision. To the best of our knowledge, this is the first work to leverage the vision-and-language cross-modality approach for the domain generalization task. Our experiments with a newly created CUB-DG benchmark dataset demonstrate that cross-modality supervision can be successfully used to ground domain-invariant visual representations and improve the model generalization. Furthermore, in the large-scale DomainBed benchmark, our proposed method achieves state-of-the-art results and ranks 1st in average performance for five multi-domain datasets. The dataset and codes are available at https://github.com/mswzeus/GVRT.
翻译:缩小源域和目标域之间的代表性差异是最大限度地扩大模型概括化的关键组成部分。 在这项工作中,我们提倡利用自然语言监督来进行域通用任务。我们为地面视觉演示引入了两个模块,其文本含有人类典型推理:(1)视觉和文字联合嵌入器,(2)文字解释生成器。前者学习图像-文字联合嵌入空间,我们可以将高等级的阶级差异性信息定位到模型中。后者利用一个可解释的模式,为其决定的理由提供解释。根据我们的知识,这是利用视觉和语言跨模式方法进行域通用任务的第一个工作。我们与新建的CUB-DG基准数据集的实验表明,交叉模式监督可以成功地用于域-变量视觉演示,改进模型的通用性。此外,在大型的DomaineBed基准中,我们拟议的方法取得了最先进的结果,并在五个多域数据集的平均性能中排名第一。数据设置和代码在 http://domamain/Vgus。