In recent past, several domain generalization (DG) methods have been proposed, showing encouraging performance, however, almost all of them build on convolutional neural networks (CNNs). There is little to no progress on studying the DG performance of vision transformers (ViTs), which are challenging the supremacy of CNNs on standard benchmarks, often built on i.i.d assumption. This renders the real-world deployment of ViTs doubtful. In this paper, we attempt to explore ViTs towards addressing the DG problem. Similar to CNNs, ViTs also struggle in out-of-distribution scenarios and the main culprit is overfitting to source domains. Inspired by the modular architecture of ViTs, we propose a simple DG approach for ViTs, coined as self-distillation for ViTs. It reduces the overfitting to source domains by easing the learning of input-output mapping problem through curating non-zero entropy supervisory signals for intermediate transformer blocks. Further, it does not introduce any new parameters and can be seamlessly plugged into the modular composition of different ViTs. We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets. Moreover, we report favorable performance against recent state-of-the-art DG methods. Our code along with pre-trained models are publicly available at: https://github.com/maryam089/SDViT
翻译:不久前,提出了几种领域通用(DG)方法,显示了令人鼓舞的业绩,然而,几乎所有这些方法都以进化神经网络为基础。在研究视觉变压器DG性能(ViTs)方面几乎没有取得任何进展,这些变压器的DG性能挑战CNN在标准基准上的至高地位,通常以i.i.d.d假设为基础。这使得ViTs的真实世界部署情况令人怀疑。在本文件中,我们试图探索ViTs解决DG问题。类似于CNNSD, ViTs也在分配方案之外挣扎,而主要罪魁祸首过于适合源域。在ViTs模块结构的启发下,我们为ViTs的DGs提出了简单的DG方法,作为VTs的自我提炼。这通过简化对投入-输出绘图问题的学习,降低了对源域的过度适应性能。我们试图探索任何新的参数,并且可以在不同的ViBT模式中无缝地插入我们最新的ViT的模块构成。我们的经验性地展示了五大的基础数据,我们有不同的业绩,我们有不同的标准。