In the recent past, several domain generalization (DG) methods have been proposed, showing encouraging performance, however, almost all of them build on convolutional neural networks (CNNs). There is little to no progress on studying the DG performance of vision transformers (ViTs), which are challenging the supremacy of CNNs on standard benchmarks, often built on i.i.d assumption. This renders the real-world deployment of ViTs doubtful. In this paper, we attempt to explore ViTs towards addressing the DG problem. Similar to CNNs, ViTs also struggle in out-of-distribution scenarios and the main culprit is overfitting to source domains. Inspired by the modular architecture of ViTs, we propose a simple DG approach for ViTs, coined as self-distillation for ViTs. It reduces the overfitting of source domains by easing the learning of input-output mapping problem through curating non-zero entropy supervisory signals for intermediate transformer blocks. Further, it does not introduce any new parameters and can be seamlessly plugged into the modular composition of different ViTs. We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets. Moreover, we report favorable performance against recent state-of-the-art DG methods. Our code along with pre-trained models are publicly available at: https://github.com/maryam089/SDViT.
翻译:在最近的过去,人们提出了几种领域通用(DG)方法,显示了令人鼓舞的业绩,然而,几乎所有这些方法都以进化神经网络为基础。在研究视觉变压器(VIT)D(DG)D(DG)业绩方面几乎毫无进展,这些变压器在标准基准上对CNN的至高无上性提出了挑战,通常以i.i.d.d假设为基础。这使得虚拟变压器的真实世界部署令人怀疑。在本文件中,我们试图探索VT(VIT)处理DG问题。类似CNNs,VIT也试图超越分配情景,而主犯则过度适应源域。在VIT的模块结构的启发下,我们为VIT(VT)提出了简单的DG(DG)业绩方法,作为VIT的自我提炼。这通过为中间变压器区校校校校非零摄像仪监督信号来减少对源域的过度配置。此外,它没有引入任何新的参数,而且可以在不同的VIT(VAT)模块构成中与不同的 ViT(VAT)域码。我们的经验性地展示了五DG-dg-dg-s-dg-frealbs