Recently, Vision Transformers (ViTs) have achieved impressive results on various vision tasks. Yet, their generalization ability under different distribution shifts is rarely understood. In this work, we provide a comprehensive study on the out-of-distribution generalization of ViTs. To support a systematic investigation, we first present a taxonomy of distribution shifts by categorizing them into five conceptual groups: corruption shift, background shift, texture shift, destruction shift, and style shift. Then we perform extensive evaluations of ViT variants under different groups of distribution shifts and compare their generalization ability with CNNs. Several important observations are obtained: 1) ViTs generalize better than CNNs under multiple distribution shifts. With the same or fewer parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most distribution shifts. 2) Larger ViTs gradually narrow the in-distribution and out-of-distribution performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs by integrating adversarial learning, information theory, and self-supervised learning. By investigating three types of generalization-enhanced ViTs, we observe their gradient-sensitivity and design a smoother learning strategy to achieve a stable training process. With modified training schemes, we achieve improvements on performance towards out-of-distribution data by 4% from vanilla ViTs. We comprehensively compare three generalization-enhanced ViTs with their corresponding CNNs, and observe that: 1) For the enhanced model, larger ViTs still benefit more for the out-of-distribution generalization. 2) generalization-enhanced ViTs are more sensitive to the hyper-parameters than corresponding CNNs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures.
翻译:最近,愿景转换者(Viet Transformers)在各种愿景任务上取得了令人印象深刻的成果。然而,在不同分布变换下,他们的总体化能力很少被理解。在这项工作中,我们提供了对 ViTs 的超分布化总体化的全面研究。为了支持系统的调查,我们首先通过将其分为五个概念组来显示分布变异的分类学:腐败转移、背景转变、质变、销毁转换和风格转换。然后,我们在不同分布变换组下对ViT变异进行广泛的评价,并将这些变异性变异与CNNs的总体化能力进行比较。我们得到了一些重要的观察:(1) ViTs在多个分布变换换中比CNN更加普及化。在相同或更少的参数下,ViTs领先了相应的CNNCs,在最上一级-1的准确性变现。(2) 更大的ViTs逐渐缩小了分配和超出我们一般变现的变现化模式。我们通过将对抗性学习、信息理论和自我校准的自我校准的三种变现方法,我们更稳定地学习了总体设计。