Though vision transformers (ViTs) have exhibited impressive ability for representation learning, we empirically find that they cannot generalize well to unseen domains with previous domain generalization algorithms. In this paper, we propose a novel approach DoPrompt based on prompt learning to embed the knowledge of source domains in domain prompts for target domain prediction. Specifically, domain prompts are prepended before ViT input tokens from the corresponding source domain. Each domain prompt learns domain-specific knowledge efficiently since it is optimized only for one domain. Meanwhile, we train a prompt adapter to produce a suitable prompt for each input image based on the learned source domain prompts. At test time, the adapted prompt generated by the prompt adapter can exploit the similarity between the feature of the out-of-domain image and source domains to properly integrate the source domain knowledge. Extensive experiments are conducted on four benchmark datasets. Our approach achieves 1.4% improvements in the averaged accuracy, which is 3.5 times the improvement of the state-of-the-art algorithm with a ViT backbone.
翻译:虽然视觉变压器(ViTs)已经表现出令人印象深刻的代表性学习能力,但我们从经验中发现,它们无法以先前的域域通用算法向无形领域全面推广。 在本文中,我们提出了一个基于迅速学习的新颖的DoPrompt 方法,将源域知识嵌入域域速测中,以进行目标域预测。具体地说,在 ViT 从相应的源域输入符号之前预先预设域提示。每个域都促进有效学习特定域知识,因为它只优化一个域。同时,我们训练一个迅速的适应器,以根据所学的源域域速算法生成每个输入图像的适当速率。在测试时,快速变换的速率可以利用外域图和源域域特性之间的相似性,以适当整合源域知识。在四个基准数据集上进行了广泛的实验。我们的方法在平均精确度上实现了1.4%的改进,这是用 ViT骨干改进状态算法的3.5倍。