BERT中的非木材和商品化 (Of Non-Linearity and Commutativity in BERT) - 专知论文

会员服务 ·

0

BERT · 层 · 归纳偏好 · INTERACT · 注意力机制 ·

2021 年 1 月 12 日

Of Non-Linearity and Commutativity in BERT

翻译：BERT中的非木材和商品化

Sumu Zhao,Damian Pascual,Gino Brunner,Roger Wattenhofer

In this work we provide new insights into the transformer architecture, and in particular, its best-known variant, BERT. First, we propose a method to measure the degree of non-linearity of different elements of transformers. Next, we focus our investigation on the feed-forward networks (FFN) inside transformers, which contain 2/3 of the model parameters and have so far not received much attention. We find that FFNs are an inefficient yet important architectural element and that they cannot simply be replaced by attention blocks without a degradation in performance. Moreover, we study the interactions between layers in BERT and show that, while the layers exhibit some hierarchical structure, they extract features in a fuzzy manner. Our results suggest that BERT has an inductive bias towards layer commutativity, which we find is mainly due to the skip connections. This provides a justification for the strong performance of recurrent and weight-shared transformer models.

翻译：在这项工作中,我们对变压器结构,特别是其最著名的变压器结构,特别是其最著名的变压器,BERT提供了新的洞察力。首先,我们提出了一个衡量变压器不同成份非线性程度的方法。接下来,我们把调查的重点放在变压器内部的进料推进网络(FFN)上,这些变压器含有2/3的模型参数,迄今没有受到多少关注。我们发现,FFFFF是一个效率低下但重要的建筑元素,不能简单地被关注块所取代,而不出现性能的退化。此外,我们研究了BERT各层之间的相互作用,并表明,虽然这些层展示了某种等级结构,但以模糊的方式提取了特征。我们发现,BERT对层的顺位性存在一种感性偏向性偏向,这主要归因于跳跃式的连接。这为经常式和权重式变压器模型的强劲性能提供了理由。

0

相关内容

BERT

BERT全称Bidirectional Encoder Representations from Transformers，是预训练语言表示的方法，可以在大型文本语料库（如维基百科）上训练通用的“语言理解”模型，然后将该模型用于下游NLP任务，比如机器翻译、问答。

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

Transformer模型-深度学习自然语言处理，17页ppt

Transformer模型-深度学习自然语言处理，17页ppt

专知会员服务

107+阅读 · 2020年8月30日

数据科学导论，54页ppt，Introduction to Data Science

数据科学导论，54页ppt，Introduction to Data Science

专知会员服务

42+阅读 · 2020年7月27日

BERT到底如何work的？A Primer in BERTology: What we know about how BERT works

BERT到底如何work的？A Primer in BERTology: What we know about how BERT works

专知会员服务

50+阅读 · 2020年2月28日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

ICLR2019最佳论文出炉

ICLR2019最佳论文出炉

专知

12+阅读 · 2019年5月6日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

Dynamic Neural Networks: A Survey

Arxiv

37+阅读 · 2021年2月10日

A Survey on Visual Transformer

Arxiv

19+阅读 · 2020年12月23日

An Attentive Survey of Attention Models

An Attentive Survey of Attention Models

Arxiv

44+阅读 · 2020年12月15日

Interference and Generalization in Temporal Difference Learning

Arxiv

8+阅读 · 2020年3月13日

Visualizing and Measuring the Geometry of BERT

Visualizing and Measuring the Geometry of BERT

Arxiv

7+阅读 · 2019年10月28日

Revealing the Dark Secrets of BERT

Revealing the Dark Secrets of BERT

Arxiv

4+阅读 · 2019年9月11日

How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations

Arxiv

4+阅读 · 2019年9月11日

What Does BERT Look At? An Analysis of BERT's Attention

Arxiv

4+阅读 · 2019年6月11日

Discovery and recognition of motion primitives in human activities

Discovery and recognition of motion primitives in human activities

Arxiv

4+阅读 · 2019年2月4日

A Survey of the Recent Architectures of Deep Convolutional Neural Networks

A Survey of the Recent Architectures of Deep Convolutional Neural Networks

Arxiv

39+阅读 · 2019年1月17日

VIP会员

文章信息

相关主题

注意力机制

相关VIP内容

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

Transformer模型-深度学习自然语言处理，17页ppt

Transformer模型-深度学习自然语言处理，17页ppt

专知会员服务

107+阅读 · 2020年8月30日

数据科学导论，54页ppt，Introduction to Data Science

数据科学导论，54页ppt，Introduction to Data Science

专知会员服务

42+阅读 · 2020年7月27日

BERT到底如何work的？A Primer in BERTology: What we know about how BERT works

BERT到底如何work的？A Primer in BERTology: What we know about how BERT works

专知会员服务

50+阅读 · 2020年2月28日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《面向无人机集群的避障动态传感器覆盖算法》最新38页

【博士论文】推进数据高效的深度学习：非参数 Transformer、主动测试与上下文学习

《物联网（IoT）中的无人机通信高效控制》135页

《在GNSS信号降级环境中利用共识实现无人机集群稳健协调》

相关资讯

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

ICLR2019最佳论文出炉

ICLR2019最佳论文出炉

专知

12+阅读 · 2019年5月6日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

相关论文

Dynamic Neural Networks: A Survey

Arxiv

37+阅读 · 2021年2月10日

A Survey on Visual Transformer

Arxiv

19+阅读 · 2020年12月23日

An Attentive Survey of Attention Models

An Attentive Survey of Attention Models

Arxiv

44+阅读 · 2020年12月15日

Interference and Generalization in Temporal Difference Learning

Arxiv

8+阅读 · 2020年3月13日

Visualizing and Measuring the Geometry of BERT

Visualizing and Measuring the Geometry of BERT

Arxiv

7+阅读 · 2019年10月28日

Revealing the Dark Secrets of BERT

Revealing the Dark Secrets of BERT

Arxiv

4+阅读 · 2019年9月11日

How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations

Arxiv

4+阅读 · 2019年9月11日

What Does BERT Look At? An Analysis of BERT's Attention

Arxiv

4+阅读 · 2019年6月11日

Discovery and recognition of motion primitives in human activities

Discovery and recognition of motion primitives in human activities

Arxiv

4+阅读 · 2019年2月4日

A Survey of the Recent Architectures of Deep Convolutional Neural Networks

A Survey of the Recent Architectures of Deep Convolutional Neural Networks

Arxiv

39+阅读 · 2019年1月17日

微信扫码咨询专知VIP会员