Domain adaptation has been vastly investigated in computer vision but still requires access to target images at train time, which might be intractable in some uncommon conditions. In this paper, we propose the task of `Prompt-driven Zero-shot Domain Adaptation', where we adapt a model trained on a source domain using only a single general textual description of the target domain, i.e., a prompt. First, we leverage a pretrained contrastive vision-language model (CLIP) to optimize affine transformations of source features, steering them towards target text embeddings, while preserving their content and semantics. Second, we show that augmented features can be used to perform zero-shot domain adaptation for semantic segmentation. Experiments demonstrate that our method significantly outperforms CLIP-based style transfer baselines on several datasets for the downstream task at hand. Our prompt-driven approach even outperforms one-shot unsupervised domain adaptation on some datasets, and gives comparable results on others. Our code is available at https://github.com/astra-vision/PODA.
翻译:在计算机视野中,对域的适应进行了广泛的调查,但仍然需要在列车时间访问目标图像,这在某些不寻常的条件下可能是棘手的。在本文中,我们提议了“零光域快速驱动适应”的任务,在源域方面,我们只使用对目标域的单一一般文本描述,即快速地,来调整一个经过培训的模型。首先,我们利用预先训练的对比性视觉语言模型(CLIP),优化源特性的近距离转换,引导它们转向目标文本嵌入,同时保留其内容和语义。第二,我们表明,增强的功能可用于对语义分区进行零光域调整。实验表明,我们的方法大大优于用于当前下游任务的若干数据集的CLIP风格传输基线。我们的快速驱动方法甚至超越了某些数据集上一发式的、不统一域适应,并对其他数据集提供了类似的结果。我们的代码可在 https://github.com/astra-vision/PODAS中查阅。</s>