ZeRO-无限:打破 GPU 记忆墙,用于极小规模深层学习 (ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning)

In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB). Therefore, the growth in model scale has been supported primarily though system innovations that allow large models to fit in the aggregate GPU memory of multiple GPUs. However, we are getting close to the GPU memory wall. It requires 800 NVIDIA V100 GPUs just to fit a trillion parameter model for training, and such clusters are simply out of reach for most data scientists. In addition, training models at that scale requires complex combinations of parallelism techniques that puts a big burden on the data scientists to refactor their model. In this paper we present ZeRO-Infinity, a novel heterogeneous system technology that leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring. At the same time it achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth. ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters. It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible. In terms of training throughput and scalability, it sustains over 25 petaflops on 512 NVIDIA V100 GPUs(40% of peak), while also demonstrating super linear scalability. An open source implementation of ZeRO-Infinity is available through DeepSpeed, a deep learning optimization library that makes distributed training easy, efficient, and effective.

翻译：在过去三年中,最密集的深层学习模型已经发展超过1000x,达到数千亿参数,而GPU记忆仅增长5x(16GB至80GB)。因此,模型规模的增长主要通过系统创新得到支持,使大型模型能够适应多个GPU的GPU记忆总量。然而,我们正在接近GPU记忆墙。需要800个NVIDIA V100 GPU,才适合万亿个参数模型的培训,而大多数数据科学家都无法接触到这些群落。此外,这一规模的培训模型需要复杂的平行技术组合,这给数据科学家重塑模型带来巨大的负担。在本文件中,我们展示ZeRO-Infinity,这是一种新型的混合系统技术,它利用了GPU、CPU和NVME记忆,使有限资源的模型规模达到前所未有的模式。与此同时,在有限的CPU或NVMeal的带宽度上,它获得高质量的公开和伸缩性培训。ZeRO-InfrialA的深度参数可以使当前GDVA的模型更易被使用。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【图与几何深度学习】Graph and geometric deep learning，49页ppt

专知会员服务

65+阅读 · 2021年4月24日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日