The increasing scale of model size and continuous improvement of performance herald the arrival of the Big Model era. In this report, we explore what and how the big model training works by diving into training objectives and training methodologies. Specifically,training objectives describe how to leverage web-scale data to develop extremely capable and incredibly large models based on self-supervised learning, and training methodologies which are based on distributed training describe how to make big model training a reality. We summarize the existing training methodologies into three main categories: training parallelism, memory-saving technologies, and model sparsity design. Training parallelism can be categorized into data, pipeline, and tensor parallelism according to the dimension of parallelism that takes place. Memory-saving technologies are orthogonal and complementary to training parallelism. And model sparsity design further scales up the model size with a constant computational cost. A continuously updated paper list of big model training is provided at https://github.com/qhliu26/BM-Training.
翻译:模型规模的扩大和业绩的不断改进预示着大模型时代的到来。我们在本报告中探讨了大型模型培训如何通过跳入培训目标和培训方法发挥作用。具体地说,培训目标描述了如何利用网络规模数据开发以自我监督的学习为基础的极其有能力和令人难以置信的大型模型,基于分布式培训的培训方法描述了如何使大型模型培训成为现实。我们将现有的培训方法归纳为三大类:培训平行、记忆保存技术和模型弥漫设计。培训平行可以按照平行主义的层面分为数据、管道和超声道。记忆保存技术是垂直的,是对培训平行主义的补充。模型宽度设计进一步提升模型规模,并不断计算成本。在https://github.com/qhliu26/BM-培训中不断更新的大型模型培训文件清单。