The pre-trained model (PTM) is revolutionizing Artificial Intelligence (AI) technology. However, the hardware requirement of PTM training is prohibitively high, making it a game for a small proportion of people. Therefore, we proposed PatrickStar system to lower the hardware requirements of PTMs and make them accessible to everyone. PatrickStar uses the CPU-GPU heterogeneous memory space to store the model data. Different from existing works, we organize the model data in memory chunks and dynamically distribute them in the heterogeneous memory. Guided by the runtime memory statistics collected in a warm-up iteration, chunks are orchestrated efficiently in heterogeneous memory and generate lower CPU-GPU data transmission volume and higher bandwidth utilization. Symbiosis with the Zero Redundancy Optimizer, PatrickStar scales to multiple GPUs on multiple nodes. % using data parallelism. The system can train tasks on bigger models and larger batch sizes, which cannot be accomplished by existing works. Experimental results show that PatrickStar extends model scales 2.27 and 2.5 times of DeepSpeed, and consistently exhibits significantly higher execution speed. PatricStar also successfully runs the 175B GPT3 training task on a 32 GPU cluster. Our code is publicly available at https://github.com/Tencent/PatrickStar.
翻译:培训前的模型(PTM)正在革命人工智能(AI)技术。然而,PTM培训的硬件要求高得令人望而却步,使它成为一小部分人的游戏。因此,我们提议PatrickStar系统来降低PTM的硬件要求,并使每个人都可以进入。 PatrickStar 使用 CPU-GPU 混合记忆空间来存储模型数据。与现有的工作不同,我们将模型数据组织在记忆块中,并在混杂的记忆中进行动态分配。根据在暖化循环中收集的运行时间记忆统计,大块在混杂的记忆中有效操作,产生低的CPU-GPU数据传输量和更高的带宽利用率。与Zero Redance Appimizer, PatrickStar 比例表在多个节点上对多个GPUPS.% 进行交替。这个系统可以对现有工程无法完成的更大模型和大批量任务进行培训。实验结果显示,PatrickStar扩大了深Speen的模2.27和2.5倍,并持续展示高得多的执行速度。PIstrual Stab GPAR3号也成功。