The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude ($\ell_2$ norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP. We leverage the emergent discrete structure in a saturated transformer to analyze the role of different attention heads, finding that some focus locally on a small number of positions, while other heads compute global averages, allowing counting. We believe understanding the interplay between these two capabilities may shed further light on the structure of computation within large transformers.
翻译:众所周知,广泛采用的变压器等神经网络的能力非常高。 有证据表明,由于培训常规中的感化偏差,通常是一种梯度下降的变种(GD),它们成功地学习了学习。为了更好地了解这种偏差,我们研究了在培训期间变压器参数的增量趋势($ell_2$ color),以及这种变压器参数对自我关注层中新兴代表的影响。我们记录了变压器语言模型培训中的正常增长,包括T5在培训前的训练中。随着参数规模的扩大,我们证明网络接近一个带有饱和启动功能的离散网络。这样的“饱和”网络与以正式语言和自动数据描述的整个网络大家庭相比,其容量减少。我们的结果显示,变压器对GD中隐含的、对NLP特别有兴趣的反向性偏差的新描述。我们利用刚出现的离散结构在饱和变压器中分析不同注意头的作用,我们发现,有些焦点集中在一个小的局部位置上,而其他头认为,与整个网络的能力较低,比整个网络的容量小的容量较低,而我们相信这些变压了这些变压器内部的计算能力。