重新思考变换器和ResNets中层正常化的跳过连接 (Rethinking Skip Connection with Layer Normalization in Transformers and ResNets) - 专知论文

会员服务 ·

0

跳跃连接 · 层规范化 · 层 · 缩放 · Performer ·

2021 年 5 月 15 日

Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

翻译：重新思考变换器和ResNets中层正常化的跳过连接

Fenglin Liu,Xuancheng Ren,Zhiyuan Zhang,Xu Sun,Yuexian Zou

from arxiv, Accepted by COLING2020 (The 28th International Conference on Computational Linguistics (COLING 2020))

Skip connection, is a widely-used technique to improve the performance and the convergence of deep neural networks, which is believed to relieve the difficulty in optimization due to non-linearity by propagating a linear component through the neural network layers. However, from another point of view, it can also be seen as a modulating mechanism between the input and the output, with the input scaled by a pre-defined value one. In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could be addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection. Inspired by the findings, we further propose to adaptively adjust the scale of the input by recursively applying skip connection with layer normalization, which promotes the performance substantially and generalizes well across diverse tasks including both machine translation and image classification datasets.

翻译：跳过连接是一种广泛使用的改进深神经网络性能和趋同的技术,据认为,它通过通过神经网络层传播线性组件,通过神经网络层传播线性组件,可以减轻由于非线性而造成的优化困难。然而,从另一个角度看,它也可以被视为输入和输出之间的调制机制,输入以预先定义的值缩放为缩放。在这项工作中,我们调查跳过连接有效性的尺度因素如何,并揭示,根据模型的深度,微小的调整将导致虚假的梯度爆炸或消失,这可以通过正常化,特别是层的正常化来解决,这促使对直线性跳过连接的不断改进。根据调查结果,我们进一步建议通过反复跳过与层正常化的连接,调整投入的规模,从而极大地促进业绩,并广泛贯穿各种任务,包括机器翻译和图像分类数据集。

0

相关内容

跳跃连接

跳跃连接可以解决网络层数较深的情况下梯度消失的问题，同时有助于梯度的反向传播，加快训练过程。

“内卷“算子超越卷积、自注意力机制：CVPR2021强大的神经网络新算子involution

专知会员服务

28+阅读 · 2021年3月27日

【ST2020硬核课】深度神经网络，57页ppt

【ST2020硬核课】深度神经网络，57页ppt

专知会员服务

48+阅读 · 2020年8月19日

自然语言处理中的注意力机制，Attention in Natural Language Processing

自然语言处理中的注意力机制，Attention in Natural Language Processing

专知会员服务

136+阅读 · 2020年5月30日

【Google大脑】进化正则激活层，Evolving Normalization-Activation Layers

【Google大脑】进化正则激活层，Evolving Normalization-Activation Layers

专知会员服务

19+阅读 · 2020年4月9日

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

专知会员服务

26+阅读 · 2020年3月26日

【伯克利】再思考 Transformer中的Batch Normalization

【伯克利】再思考 Transformer中的Batch Normalization

专知会员服务

41+阅读 · 2020年3月21日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

深度卷积神经网络中的降采样

深度卷积神经网络中的降采样

极市平台

12+阅读 · 2019年5月24日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Self-Attention GAN 中的 self-attention 机制

Self-Attention GAN 中的 self-attention 机制

PaperWeekly

12+阅读 · 2019年3月6日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

Highway Networks For Sentence Classification

Highway Networks For Sentence Classification

哈工大SCIR

4+阅读 · 2017年9月30日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers

Arxiv

0+阅读 · 2021年7月7日

Rethinking Positional Encoding in Language Pre-training

Arxiv

4+阅读 · 2020年7月9日

On Layer Normalization in the Transformer Architecture

Arxiv

4+阅读 · 2020年2月12日

The Knowledge Within: Methods for Data-Free Model Compression

The Knowledge Within: Methods for Data-Free Model Compression

Arxiv

3+阅读 · 2019年12月3日

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Arxiv

18+阅读 · 2019年9月25日

Convolutional Self-Attention Network

Arxiv

6+阅读 · 2019年4月8日

Fusing Recency into Neural Machine Translation with an Inter-Sentence Gate Model

Arxiv

3+阅读 · 2018年6月12日

Unsupervised Neural Machine Translation with Weight Sharing

Arxiv

6+阅读 · 2018年4月24日

Self-Attentive Residual Decoder for Neural Machine Translation

Arxiv

5+阅读 · 2018年3月22日

Language Modeling with Gated Convolutional Networks

Arxiv

5+阅读 · 2017年9月8日

VIP会员

文章信息

相关主题

相关VIP内容

“内卷“算子超越卷积、自注意力机制：CVPR2021强大的神经网络新算子involution

专知会员服务

28+阅读 · 2021年3月27日

【ST2020硬核课】深度神经网络，57页ppt

【ST2020硬核课】深度神经网络，57页ppt

专知会员服务

48+阅读 · 2020年8月19日

自然语言处理中的注意力机制，Attention in Natural Language Processing

自然语言处理中的注意力机制，Attention in Natural Language Processing

专知会员服务

136+阅读 · 2020年5月30日

【Google大脑】进化正则激活层，Evolving Normalization-Activation Layers

【Google大脑】进化正则激活层，Evolving Normalization-Activation Layers

专知会员服务

19+阅读 · 2020年4月9日

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

专知会员服务

26+阅读 · 2020年3月26日

【伯克利】再思考 Transformer中的Batch Normalization

【伯克利】再思考 Transformer中的Batch Normalization

专知会员服务

41+阅读 · 2020年3月21日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《复杂工程系统模型驱动设计决策支持系统：早期设计阶段挑战》最新138页

《日本陆上自卫队2040年作战方式与未来作战研究》最新23页slides

人工智能作为战争武器

《后勤保障》最新23页

相关资讯

深度卷积神经网络中的降采样

深度卷积神经网络中的降采样

极市平台

12+阅读 · 2019年5月24日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Self-Attention GAN 中的 self-attention 机制

Self-Attention GAN 中的 self-attention 机制

PaperWeekly

12+阅读 · 2019年3月6日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

Highway Networks For Sentence Classification

Highway Networks For Sentence Classification

哈工大SCIR

4+阅读 · 2017年9月30日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

相关论文

SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers

Arxiv

0+阅读 · 2021年7月7日

Rethinking Positional Encoding in Language Pre-training

Arxiv

4+阅读 · 2020年7月9日

On Layer Normalization in the Transformer Architecture

Arxiv

4+阅读 · 2020年2月12日

The Knowledge Within: Methods for Data-Free Model Compression

The Knowledge Within: Methods for Data-Free Model Compression

Arxiv

3+阅读 · 2019年12月3日

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Arxiv

18+阅读 · 2019年9月25日

Convolutional Self-Attention Network

Arxiv

6+阅读 · 2019年4月8日

Fusing Recency into Neural Machine Translation with an Inter-Sentence Gate Model

Arxiv

3+阅读 · 2018年6月12日

Unsupervised Neural Machine Translation with Weight Sharing

Arxiv

6+阅读 · 2018年4月24日

Self-Attentive Residual Decoder for Neural Machine Translation

Arxiv

5+阅读 · 2018年3月22日

Language Modeling with Gated Convolutional Networks

Arxiv

5+阅读 · 2017年9月8日

微信扫码咨询专知VIP会员