We perform an effective-theory analysis of forward-backward signal propagation in wide and deep Transformers, i.e., residual neural networks with multi-head self-attention blocks and multilayer perceptron blocks. This analysis suggests particular width scalings of initialization and training hyperparameters for these models. We then take up such suggestions, training Vision and Language Transformers in practical setups.
翻译:我们对宽且深的 Transformer 模型进行一种有效性理论分析,即具有多头自注意块和多层感知机块的残差神经网络的前向-后向信号传播。此次分析提示了这些模型的初始化和训练超参数的特定宽度缩放。我们随后采纳了这些建议,以实际环境中进行视觉和语言 Transformer 的训练。