We present a Lipschitz continuous Transformer, called LipsFormer, to pursue training stability both theoretically and empirically for Transformer-based models. In contrast to previous practical tricks that address training instability by learning rate warmup, layer normalization, attention formulation, and weight initialization, we show that Lipschitz continuity is a more essential property to ensure training stability. In LipsFormer, we replace unstable Transformer component modules with Lipschitz continuous counterparts: CenterNorm instead of LayerNorm, spectral initialization instead of Xavier initialization, scaled cosine similarity attention instead of dot-product attention, and weighted residual shortcut. We prove that these introduced modules are Lipschitz continuous and derive an upper bound on the Lipschitz constant of LipsFormer. Our experiments show that LipsFormer allows stable training of deep Transformer architectures without the need of careful learning rate tuning such as warmup, yielding a faster convergence and better generalization. As a result, on the ImageNet 1K dataset, LipsFormer-Swin-Tiny based on Swin Transformer training for 300 epochs can obtain 82.7\% without any learning rate warmup. Moreover, LipsFormer-CSwin-Tiny, based on CSwin, training for 300 epochs achieves a top-1 accuracy of 83.5\% with 4.7G FLOPs and 24M parameters. The code will be released at \url{https://github.com/IDEA-Research/LipsFormer}.
翻译:我们提出了一种Lipschitz连续的Transformer,称为LipsFormer,以在理论和经验上追求Transformer-based模型的训练稳定性。与以前解决训练不稳定性的实用技巧不同,这些技巧包括学习率预热、层规范化、注意力公式和权重初始化,我们展示了Lipschitz连续性作为确保训练稳定性的更重要的属性。在LipsFormer中,我们用Lipschitz连续的替代不稳定的Transformer组件模块:CenterNorm代替LayerNorm,谱初始化代替Xavier初始化,缩放余弦相似性注意代替点积注意力,以及加权残差快捷方式。我们证明这些引入的模块是Lipschitz连续的,并推导出LipsFormer的Lipschitz常数的上界。我们的实验表明,LipsFormer允许稳定训练深Transformer架构,无需仔细调整学习率,例如预热,从而实现更快的收敛和更好的泛化。因此,在ImageNet 1K数据集上,基于Swin Transformer的LipsFormer-Swin-Tiny训练300个epochs,可以获得82.7%的结果,而没有任何学习率预热。此外, 基于CSwin的LipsFormer-CSwin-Tiny,在训练300个epochs,并在4.7G FLOPs和24M参数的情况下,实现了83.5%的Top-1精度。 代码将在\url{https://github.com/IDEA-Research/LipsFormer}上发布。