TVT: 用于无人监督域适应的可转移愿景变换器 (TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation)

Unsupervised domain adaptation (UDA) aims to transfer the knowledge learnt from a labeled source domain to an unlabeled target domain. Previous work is mainly built upon convolutional neural networks (CNNs) to learn domain-invariant representations. With the recent exponential increase in applying Vision Transformer (ViT) to vision tasks, the capability of ViT in adapting cross-domain knowledge, however, remains unexplored in the literature. To fill this gap, this paper first comprehensively investigates the transferability of ViT on a variety of domain adaptation tasks. Surprisingly, ViT demonstrates superior transferability over its CNNs-based counterparts with a large margin, while the performance can be further improved by incorporating adversarial adaptation. Notwithstanding, directly using CNNs-based adaptation strategies fails to take the advantage of ViT's intrinsic merits (e.g., attention mechanism and sequential image representation) which play an important role in knowledge transfer. To remedy this, we propose an unified framework, namely Transferable Vision Transformer (TVT), to fully exploit the transferability of ViT for domain adaptation. Specifically, we delicately devise a novel and effective unit, which we term Transferability Adaption Module (TAM). By injecting learned transferabilities into attention blocks, TAM compels ViT focus on both transferable and discriminative features. Besides, we leverage discriminative clustering to enhance feature diversity and separation which are undermined during adversarial domain alignment. To verify its versatility, we perform extensive studies of TVT on four benchmarks and the experimental results demonstrate that TVT attains significant improvements compared to existing state-of-the-art UDA methods.

翻译：不受监督的域适应(UDA)旨在将从标签源域域到没有标签的目标域域所学到的知识转移到没有标签的目标域域; 以往的工作主要建立在革命性神经网络(CNNs)上,以学习域反动的表示方式。由于最近应用视野变异器(View Tranger)来完成视觉任务的情况急剧增加,ViT在调整跨域知识方面的能力在文献中仍然没有得到探讨。为了填补这一空白,本文件首先全面调查VIT在各种域域域电视适应任务方面的可转移性。令人惊讶的是, ViT显示比其CNN的对口网络(CNNNs)更具有较高的可转移性,有很大的幅度,而通过纳入对抗性适应性调整,业绩可以进一步改进。尽管如此,直接使用CNNIS的适应战略未能利用View Tranger(例如注意机制和顺序图象代表)在知识转移方面起到重要作用的内在优点。为了弥补这一点,我们提出了一个统一框架,即可转移视野变异变变变变变变变变变变变变变换(VT),我们从可变变变变变的域域域变变变变变变变变变变的模型研究。