Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pre-training and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rates). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3\% and +11.0\% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pre-training or strong data augmentations. They also possess more perceptive attention maps. Our model checkpoints are released at \url{https://github.com/google-research/vision_transformer}.
翻译:因此,本文件从损失几何角度对ViTs和MLP-Mixers进行了调查,目的是提高模型在培训和推理一般化时的数据效率。视觉化和赫瑟展示了非常敏锐的趋同模型模型。通过大规模数据,例如大规模预培训和(或)反复强烈的数据增强数据增强模型能力,并且仍然报告与优化有关的问题(例如,对初始化和学习率的敏感性)。因此,本文件从损失几何角度对ViTs和MLP-Mixers进行了调查,目的是提高模型在培训和一般化中的数据效率。视觉化和赫瑟森展示了非常敏锐的本地集成模型。通过最近提议的敏锐度-觉悟性优化优化和(或)多次强的数据增强模型的光度,我们大幅改进了ViT-B/16和Mixer-B/16在图像网络网络上的数据效率。我们还改进了Silvarial-derial Expression Expressional pressional pressal pressional pressional pressal pressal pressal.