In this work we provide new insights into the transformer architecture, and in particular, its best-known variant, BERT. First, we propose a method to measure the degree of non-linearity of different elements of transformers. Next, we focus our investigation on the feed-forward networks (FFN) inside transformers, which contain 2/3 of the model parameters and have so far not received much attention. We find that FFNs are an inefficient yet important architectural element and that they cannot simply be replaced by attention blocks without a degradation in performance. Moreover, we study the interactions between layers in BERT and show that, while the layers exhibit some hierarchical structure, they extract features in a fuzzy manner. Our results suggest that BERT has an inductive bias towards layer commutativity, which we find is mainly due to the skip connections. This provides a justification for the strong performance of recurrent and weight-shared transformer models.
翻译:在这项工作中,我们对变压器结构,特别是其最著名的变压器结构,特别是其最著名的变压器,BERT提供了新的洞察力。首先,我们提出了一个衡量变压器不同成份非线性程度的方法。接下来,我们把调查的重点放在变压器内部的进料推进网络(FFN)上,这些变压器含有2/3的模型参数,迄今没有受到多少关注。我们发现,FFFFF是一个效率低下但重要的建筑元素,不能简单地被关注块所取代,而不出现性能的退化。此外,我们研究了BERT各层之间的相互作用,并表明,虽然这些层展示了某种等级结构,但以模糊的方式提取了特征。我们发现,BERT对层的顺位性存在一种感性偏向性偏向,这主要归因于跳跃式的连接。这为经常式和权重式变压器模型的强劲性能提供了理由。