Norum Former:改进变形器 (NormFormer: Improved Transformer Pretraining with Extra Normalization) - 专知论文

会员服务 ·

0

Perplexity · Performer · 层 · 规范化的 · 掩码语言模型化 ·

2021 年 10 月 18 日

NormFormer: Improved Transformer Pretraining with Extra Normalization

翻译：Norum Former:改进变形器

Sam Shleifer,Jason Weston,Myle Ott

During pretraining, the Pre-LayerNorm transformer suffers from a gradient magnitude mismatch: gradients at early layers are much larger than at later layers. These issues can be alleviated by our proposed NormFormer architecture, which adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first fully connected layer. The extra operations incur negligible compute cost (+0.4% parameter increase), but improve pretraining perplexity and downstream task performance for both causal and masked language models ranging from 125 Million to 2.7 Billion parameters. For example, adding NormFormer on top of our strongest 1.3B parameter baseline can reach equal perplexity 24% faster, or converge 0.27 perplexity better in the same compute budget. This model reaches GPT3-Large (1.3B) zero shot performance 60% faster. For masked language modeling, NormFormer improves fine-tuned GLUE performance by 1.9% on average. Code to train NormFormer models is available in fairseq https://github.com/pytorch/fairseq/tree/main/examples/normformer .

翻译：在预训练期间,LayerNorm前变压器存在一个梯度错差:早期层的梯度比后层的梯度要大得多。这些问题可以通过我们提议的NormFormer结构来缓解。我们提议的NormFormer结构为每个层增加了三个正常化操作: 自我关注后的层圈, 自我注意后头视的输出尺度和第一个完全连接的层之后的层圈。额外操作产生可忽略不计的计算成本( +0. 4% 参数增加), 但对于从125万到2. 7 10 亿的因果和遮蔽语言模型, 培训前的难度和下游任务性能都有所改善。例如, 在我们1.3B 最强的参数基线之上添加NormFormer可以更快地达到相等的复度 24%, 在同一计算预算中将0. 27 的复度集中得更好。这个模型达到 GPT3- Large (1.3B) 零射速性能60% 。对于蒙语言模型来说, NormFormers 平均将精调GLUE的性表现提高1.9%。培训NormFormer模型的代码可以在 https://gistrebub/ report/ report/ report/formormormexx/ regres/ regreystrystry/

0

相关内容

Perplexity

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

324+阅读 · 2020年11月26日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

Transformer模型-深度学习自然语言处理，17页ppt

Transformer模型-深度学习自然语言处理，17页ppt

专知会员服务

108+阅读 · 2020年8月30日

【ICML 2020】设置LayerNorm使Transformer加速收敛

专知会员服务

16+阅读 · 2020年7月27日

【伯克利】再思考 Transformer中的Batch Normalization

【伯克利】再思考 Transformer中的Batch Normalization

专知会员服务

41+阅读 · 2020年3月21日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

ExBert — 可视化分析Transformer学到的表示

ExBert — 可视化分析Transformer学到的表示

专知会员服务

32+阅读 · 2019年10月16日

深度卷积神经网络中的降采样

深度卷积神经网络中的降采样

极市平台

12+阅读 · 2019年5月24日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

已删除

将门创投

6+阅读 · 2019年1月2日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

强化学习 cartpole_a3c

强化学习 cartpole_a3c

CreateAMind

9+阅读 · 2017年7月21日

LAFITE: Towards Language-Free Training for Text-to-Image Generation

Arxiv

0+阅读 · 2021年12月13日

Dynamic Token Normalization Improves Vision Transformer

Arxiv

0+阅读 · 2021年12月5日

Vision Pair Learning: An Efficient Training Framework for Image Classification

Arxiv

0+阅读 · 2021年12月2日

Transformer in Transformer

Arxiv

11+阅读 · 2021年10月26日

Transformer Tracking

Arxiv

17+阅读 · 2021年3月29日

High-Performance Large-Scale Image Recognition Without Normalization

Arxiv

5+阅读 · 2021年2月11日

Contrastive Triple Extraction with Generative Transformer

Arxiv

3+阅读 · 2020年9月14日

On Layer Normalization in the Transformer Architecture

Arxiv

4+阅读 · 2020年2月12日

Self-training with Noisy Student improves ImageNet classification

Arxiv

15+阅读 · 2019年11月11日

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Arxiv

3+阅读 · 2019年9月25日

VIP会员

文章信息

相关主题

掩码语言模型化

相关VIP内容

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

324+阅读 · 2020年11月26日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

Transformer模型-深度学习自然语言处理，17页ppt

Transformer模型-深度学习自然语言处理，17页ppt

专知会员服务

108+阅读 · 2020年8月30日

【ICML 2020】设置LayerNorm使Transformer加速收敛

专知会员服务

16+阅读 · 2020年7月27日

【伯克利】再思考 Transformer中的Batch Normalization

【伯克利】再思考 Transformer中的Batch Normalization

专知会员服务

41+阅读 · 2020年3月21日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

ExBert — 可视化分析Transformer学到的表示

ExBert — 可视化分析Transformer学到的表示

专知会员服务

32+阅读 · 2019年10月16日

热门VIP内容

开通专知VIP会员享更多权益服务

人机协同作战规划：来自美海军陆战队的大语言模型（LLM）使用教训

对北约军事总部战略规划制定与实施的研究 | 140页

美联参会指南-联合规划与执行概述及政策框架 | 32页

俄罗斯军事规划差异性凸显其思维的重要性 | 2025最新文献

相关资讯

深度卷积神经网络中的降采样

深度卷积神经网络中的降采样

极市平台

12+阅读 · 2019年5月24日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

已删除

将门创投

6+阅读 · 2019年1月2日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

强化学习 cartpole_a3c

强化学习 cartpole_a3c

CreateAMind

9+阅读 · 2017年7月21日

相关论文

LAFITE: Towards Language-Free Training for Text-to-Image Generation

Arxiv

0+阅读 · 2021年12月13日

Dynamic Token Normalization Improves Vision Transformer

Arxiv

0+阅读 · 2021年12月5日

Vision Pair Learning: An Efficient Training Framework for Image Classification

Arxiv

0+阅读 · 2021年12月2日

Transformer in Transformer

Arxiv

11+阅读 · 2021年10月26日

Transformer Tracking

Arxiv

17+阅读 · 2021年3月29日

High-Performance Large-Scale Image Recognition Without Normalization

Arxiv

5+阅读 · 2021年2月11日

Contrastive Triple Extraction with Generative Transformer

Arxiv

3+阅读 · 2020年9月14日

On Layer Normalization in the Transformer Architecture

Arxiv

4+阅读 · 2020年2月12日

Self-training with Noisy Student improves ImageNet classification

Arxiv

15+阅读 · 2019年11月11日

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Arxiv

3+阅读 · 2019年9月25日

微信扫码咨询专知VIP会员