LipsFormer: 将Lipschitz连续性引入视觉Transformer (LipsFormer: Introducing Lipschitz Continuity to Vision Transformers) - 专知论文

会员服务 ·

0

Lipschitz连续 · Lipschitz · 学习率 · 初始化 · 学习率预热 ·

2023 年 4 月 19 日

LipsFormer: Introducing Lipschitz Continuity to Vision Transformers

翻译：LipsFormer: 将Lipschitz连续性引入视觉Transformer

Xianbiao Qi,Jianan Wang,Yihao Chen,Yukai Shi,Lei Zhang

from arxiv, To appear in ICLR 2023, our code will be public at https://github.com/IDEA-Research/LipsFormer

We present a Lipschitz continuous Transformer, called LipsFormer, to pursue training stability both theoretically and empirically for Transformer-based models. In contrast to previous practical tricks that address training instability by learning rate warmup, layer normalization, attention formulation, and weight initialization, we show that Lipschitz continuity is a more essential property to ensure training stability. In LipsFormer, we replace unstable Transformer component modules with Lipschitz continuous counterparts: CenterNorm instead of LayerNorm, spectral initialization instead of Xavier initialization, scaled cosine similarity attention instead of dot-product attention, and weighted residual shortcut. We prove that these introduced modules are Lipschitz continuous and derive an upper bound on the Lipschitz constant of LipsFormer. Our experiments show that LipsFormer allows stable training of deep Transformer architectures without the need of careful learning rate tuning such as warmup, yielding a faster convergence and better generalization. As a result, on the ImageNet 1K dataset, LipsFormer-Swin-Tiny based on Swin Transformer training for 300 epochs can obtain 82.7\% without any learning rate warmup. Moreover, LipsFormer-CSwin-Tiny, based on CSwin, training for 300 epochs achieves a top-1 accuracy of 83.5\% with 4.7G FLOPs and 24M parameters. The code will be released at \url{https://github.com/IDEA-Research/LipsFormer}.

翻译：我们提出了一种Lipschitz连续的Transformer，称为LipsFormer，以在理论和经验上追求Transformer-based模型的训练稳定性。与以前解决训练不稳定性的实用技巧不同，这些技巧包括学习率预热、层规范化、注意力公式和权重初始化，我们展示了Lipschitz连续性作为确保训练稳定性的更重要的属性。在LipsFormer中，我们用Lipschitz连续的替代不稳定的Transformer组件模块:CenterNorm代替LayerNorm，谱初始化代替Xavier初始化，缩放余弦相似性注意代替点积注意力，以及加权残差快捷方式。我们证明这些引入的模块是Lipschitz连续的，并推导出LipsFormer的Lipschitz常数的上界。我们的实验表明，LipsFormer允许稳定训练深Transformer架构，无需仔细调整学习率，例如预热，从而实现更快的收敛和更好的泛化。因此，在ImageNet 1K数据集上，基于Swin Transformer的LipsFormer-Swin-Tiny训练300个epochs,可以获得82.7％的结果，而没有任何学习率预热。此外，基于CSwin的LipsFormer-CSwin-Tiny，在训练300个epochs，并在4.7G FLOPs和24M参数的情况下，实现了83.5％的Top-1精度。代码将在\url{https://github.com/IDEA-Research/LipsFormer}上发布。

0

相关内容

Lipschitz连续

Lipschitz连续

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

【ICLR2022】Transformers亦能贝叶斯推断

【ICLR2022】Transformers亦能贝叶斯推断

专知会员服务

25+阅读 · 2021年12月23日

【斯坦福&Facebook】生成式对抗变换器，Generative Adversarial Transformers

专知会员服务

21+阅读 · 2021年4月21日

【Google】平滑对抗训练，Smooth Adversarial Training

【Google】平滑对抗训练，Smooth Adversarial Training

专知会员服务

49+阅读 · 2020年7月4日

生成式对抗网络GAN在计算机视觉中的应用概述，GANs in computer vision: Introduction to generative learning（part1）

生成式对抗网络GAN在计算机视觉中的应用概述，GANs in computer vision: Introduction to generative learning（part1）

专知会员服务

63+阅读 · 2020年4月19日

【MIT】时间序列GAN，Subadditivity of Probability Divergences

专知会员服务

63+阅读 · 2020年3月4日

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

专知会员服务

24+阅读 · 2019年11月4日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

Ladder Side-Tuning：预训练模型的“过墙梯”

Ladder Side-Tuning：预训练模型的“过墙梯”

PaperWeekly

0+阅读 · 2022年6月24日

【ICML2019】IanGoodfellow自注意力GAN的代码与PPT

【ICML2019】IanGoodfellow自注意力GAN的代码与PPT

GAN生成式对抗网络

18+阅读 · 2019年6月30日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

19篇ICML2019论文摘录选读！

19篇ICML2019论文摘录选读！

专知

28+阅读 · 2019年4月28日

Self-Attention GAN 中的 self-attention 机制

Self-Attention GAN 中的 self-attention 机制

PaperWeekly

12+阅读 · 2019年3月6日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

FAGAN：完全注意力机制（Full Attention）GAN，Self-attention+GAN

FAGAN：完全注意力机制（Full Attention）GAN，Self-attention+GAN

专知

32+阅读 · 2018年8月14日

【论文推荐】最新六篇图像分割相关论文—控制、全卷积网络、子空间表示、多模态图像分割

【论文推荐】最新六篇图像分割相关论文—控制、全卷积网络、子空间表示、多模态图像分割

专知

25+阅读 · 2018年4月15日

广义欧拉多项式的实根性

国家自然科学基金

0+阅读 · 2015年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

几类随机浅水波方程（组）的研究

国家自然科学基金

0+阅读 · 2013年12月31日

流形上的Bakry-Emery曲率，泛函不等式和热核分析

国家自然科学基金

0+阅读 · 2012年12月31日

偏微分方程中的等周不等式及其相关问题的研究

国家自然科学基金

1+阅读 · 2012年12月31日

非线性椭圆型偏微分方程的边界正则性

国家自然科学基金

0+阅读 · 2012年12月31日

视觉空间注意障碍患者功能重塑的脑机制与神经调控治疗研究

国家自然科学基金

0+阅读 · 2011年12月31日

广义Kloosterman和的均值估计

国家自然科学基金

0+阅读 · 2011年12月31日

几何偏微分方程在图像分割和去噪中的应用及其理论研究

国家自然科学基金

0+阅读 · 2009年12月31日

不可压Navier-Stokes方程的适定性与正则性研究

国家自然科学基金

0+阅读 · 2009年12月31日

Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference

Arxiv

0+阅读 · 2023年6月4日

Proteus: Simulating the Performance of Distributed DNN Training

Arxiv

0+阅读 · 2023年6月4日

Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training

Arxiv

0+阅读 · 2023年6月2日

From Malware Samples to Fractal Images: A New Paradigm for Classification. (Version 2.0, Previous version paper name: Have you ever seen malware?)

Arxiv

0+阅读 · 2023年6月1日

Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Arxiv

0+阅读 · 2023年6月1日

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Arxiv

0+阅读 · 2023年6月1日

Auto-Spikformer: Spikformer Architecture Search

Arxiv

0+阅读 · 2023年6月1日

Aerial Vision-and-Dialog Navigation

Arxiv

0+阅读 · 2023年6月1日

GAN Dissection: Visualizing and Understanding Generative Adversarial Networks

GAN Dissection: Visualizing and Understanding Generative Adversarial Networks

Arxiv

11+阅读 · 2018年12月8日

Unsupervised Cross-Modality Domain Adaptation of ConvNets for Biomedical Image Segmentations with Adversarial Loss

Arxiv

10+阅读 · 2018年4月29日

VIP会员

文章信息

相关主题

Lipschitz连续

学习率预热

相关VIP内容

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

【ICLR2022】Transformers亦能贝叶斯推断

【ICLR2022】Transformers亦能贝叶斯推断

专知会员服务

25+阅读 · 2021年12月23日

【斯坦福&Facebook】生成式对抗变换器，Generative Adversarial Transformers

专知会员服务

21+阅读 · 2021年4月21日

【Google】平滑对抗训练，Smooth Adversarial Training

【Google】平滑对抗训练，Smooth Adversarial Training

专知会员服务

49+阅读 · 2020年7月4日

生成式对抗网络GAN在计算机视觉中的应用概述，GANs in computer vision: Introduction to generative learning（part1）

生成式对抗网络GAN在计算机视觉中的应用概述，GANs in computer vision: Introduction to generative learning（part1）

专知会员服务

63+阅读 · 2020年4月19日

【MIT】时间序列GAN，Subadditivity of Probability Divergences

专知会员服务

63+阅读 · 2020年3月4日

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

专知会员服务

24+阅读 · 2019年11月4日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

《乌克兰无人机产业：志愿者与政策在构建新兴无人机产业中的协同作用》最新报告

《人工智能辅助决策中的数据可视化：系统性综述》

人工智能驱动弹药制造现代化：美国陆军转型之路

《敏捷作战部署中枢纽-辐条基地选址优化研究》80页

相关资讯

Ladder Side-Tuning：预训练模型的“过墙梯”

Ladder Side-Tuning：预训练模型的“过墙梯”

PaperWeekly

0+阅读 · 2022年6月24日

【ICML2019】IanGoodfellow自注意力GAN的代码与PPT

【ICML2019】IanGoodfellow自注意力GAN的代码与PPT

GAN生成式对抗网络

18+阅读 · 2019年6月30日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

19篇ICML2019论文摘录选读！

19篇ICML2019论文摘录选读！

专知

28+阅读 · 2019年4月28日

Self-Attention GAN 中的 self-attention 机制

Self-Attention GAN 中的 self-attention 机制

PaperWeekly

12+阅读 · 2019年3月6日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

FAGAN：完全注意力机制（Full Attention）GAN，Self-attention+GAN

FAGAN：完全注意力机制（Full Attention）GAN，Self-attention+GAN

专知

32+阅读 · 2018年8月14日

【论文推荐】最新六篇图像分割相关论文—控制、全卷积网络、子空间表示、多模态图像分割

【论文推荐】最新六篇图像分割相关论文—控制、全卷积网络、子空间表示、多模态图像分割

专知

25+阅读 · 2018年4月15日

相关论文

Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference

Arxiv

0+阅读 · 2023年6月4日

Proteus: Simulating the Performance of Distributed DNN Training

Arxiv

0+阅读 · 2023年6月4日

Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training

Arxiv

0+阅读 · 2023年6月2日

From Malware Samples to Fractal Images: A New Paradigm for Classification. (Version 2.0, Previous version paper name: Have you ever seen malware?)

Arxiv

0+阅读 · 2023年6月1日

Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Arxiv

0+阅读 · 2023年6月1日

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Arxiv

0+阅读 · 2023年6月1日

Auto-Spikformer: Spikformer Architecture Search

Arxiv

0+阅读 · 2023年6月1日

Aerial Vision-and-Dialog Navigation

Arxiv

0+阅读 · 2023年6月1日

GAN Dissection: Visualizing and Understanding Generative Adversarial Networks

GAN Dissection: Visualizing and Understanding Generative Adversarial Networks

Arxiv

11+阅读 · 2018年12月8日

Unsupervised Cross-Modality Domain Adaptation of ConvNets for Biomedical Image Segmentations with Adversarial Loss

Arxiv

10+阅读 · 2018年4月29日

相关基金

广义欧拉多项式的实根性

国家自然科学基金

0+阅读 · 2015年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

几类随机浅水波方程（组）的研究

国家自然科学基金

0+阅读 · 2013年12月31日

流形上的Bakry-Emery曲率，泛函不等式和热核分析

国家自然科学基金

0+阅读 · 2012年12月31日

偏微分方程中的等周不等式及其相关问题的研究

国家自然科学基金

1+阅读 · 2012年12月31日

非线性椭圆型偏微分方程的边界正则性

国家自然科学基金

0+阅读 · 2012年12月31日

视觉空间注意障碍患者功能重塑的脑机制与神经调控治疗研究

国家自然科学基金

0+阅读 · 2011年12月31日

广义Kloosterman和的均值估计

国家自然科学基金

0+阅读 · 2011年12月31日

几何偏微分方程在图像分割和去噪中的应用及其理论研究

国家自然科学基金

0+阅读 · 2009年12月31日

不可压Navier-Stokes方程的适定性与正则性研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员