Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical success in many vision tasks. Due to non-convex interactions across layers, however, theoretical learning and generalization analysis is mostly elusive. Based on a data model characterizing both label-relevant and label-irrelevant tokens, this paper provides the first theoretical analysis of training a shallow ViT, i.e., one self-attention layer followed by a two-layer perceptron, for a classification task. We characterize the sample complexity to achieve a zero generalization error. Our sample complexity bound is positively correlated with the inverse of the fraction of label-relevant tokens, the token noise level, and the initial model error. We also prove that a training process using stochastic gradient descent (SGD) leads to a sparse attention map, which is a formal verification of the general intuition about the success of attention. Moreover, this paper indicates that a proper token sparsification can improve the test performance by removing label-irrelevant and/or noisy tokens, including spurious correlations. Empirical experiments on synthetic data and CIFAR-10 dataset justify our theoretical results and generalize to deeper ViTs.
翻译:具有自我注意模块的视觉Transformer(ViTs)最近在许多视觉任务中取得了巨大的实证成功。然而,由于层间的非凸交互,理论学习和泛化分析大多数是难以捉摸的。基于一个既包括标签相关又包括标签无关符号的数据模型,本文提供了第一个浅层ViT的理论分析,即一个自我注意层后跟一个两层感知器,用于分类任务。我们表征了实现零泛化误差所需的样本复杂度。我们的样本复杂度边界与标签相关符号的分数的倒数、符号噪声水平和初始模型误差呈正相关。我们还证明了使用随机梯度下降(SGD)的训练过程会产生稀疏的注意图,这是关于注意力成功的常规直觉的正式验证。此外,本文表明,适当的符号稀疏化可以通过消除标签无关和/或噪声符号,包括虚假相关性,来提高测试性能。对合成数据和CIFAR-10数据集的经验实验验证了我们的理论结果,并推广到更深的ViT。