In the perspective of a layer normalization (LN) position, the architecture of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers prefer to select Pre-LN because the training in Post-LN with deep Transformers, e.g., ten or more layers, often becomes unstable, resulting in useless models. However, in contrast, Post-LN has also consistently achieved better performance than Pre-LN in relatively shallow Transformers, e.g., six or fewer layers. This study first investigates the reason for these discrepant observations empirically and theoretically and discovers 1, the LN in Post-LN is the source of the vanishing gradient problem that mainly leads the unstable training whereas Pre-LN prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher layers during the back-propagation that may lead an effective training. Exploiting the new findings, we propose a method that can equip both higher stability and effective training by a simple modification from Post-LN. We conduct experiments on a wide range of text generation tasks and demonstrate that our method outperforms Pre-LN, and stable training regardless of the shallow or deep layer settings.
翻译:从层正常化(LN)位置的角度来看,变异器的结构可以分为两类:LN后、LN前、LN前、LN前、LN前、LN后、LN后、LN后、LN后、LN后、LN前、LN后、LN后、LN后、LOF后、LN后、LOF后、LN后、LN后、LN后、LOF后、LN后、LN后、LN前、LN后、LN后、LN后、LTF、LF、LAF、LF、LFR、LFR、LF、LF、LF、LF、LF、LF、LF、LF、LF、L、LF、LF、LF、L、LF、LF、LF、LF、L、LF、L、LF、LF、L、L、LF、L、LF、LF、LF、L、L、LF、L、LF、L、L、L、LF、L、L、LF、L、L、L、S、L、L、LF、LF、S、LF、S、S、S、L、L、L、L、L、LF、LF、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、