A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name "Transformers", the above areas use different implementations for better performance, e.g., Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves as a go-to architecture for various tasks and modalities with guaranteed training stability. In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal. Specifically, we propose Sub-LayerNorm for good expressivity, and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i.e., BERT, and GPT), machine translation, vision pretraining (i.e., BEiT), speech recognition, and multimodal pretraining (i.e., BEiT-3).
翻译:各种语言、视觉、言语和多式联运模式的模型结构正在形成,但是,在同一个名称“变异者”下,上述领域采用不同的执行方法来取得更好的业绩,例如,BERT的LayerNorm后,GPT和视觉变异器的LayerNorm前。我们呼吁开发真正的通用模型基础变异器,作为各种任务和模式的转动结构,并保证培训稳定性。在这项工作中,我们引入了名为Magneto的变异器,以实现这一目标。具体地说,我们提议采用“变异器”,即“变异器”,即“显示良好表达性”,以及理论上从DeepNet产生的初始化战略,以稳定地扩大规模。广泛的实验表明其优于为各种应用设计的实际变异式,包括语言建模(即,BERT和GPT)、机器翻译、视力预培训(即BEIT)、语音识别和多式联运前培训(即BEIT-3)。