The rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Such fragmentation stems from the use of online GPU memory allocators in popular deep learning frameworks like PyTorch, which disregard tensor lifespans. As a result, this inefficiency can waste as much as 43% of memory and trigger out-of-memory errors, undermining the effectiveness of optimization methods. To address this, we introduce STAlloc, a GPU memory allocator for deep learning frameworks that reduces fragmentation by exploiting the spatial and temporal regularity in memory allocation behaviors of training workloads. STAlloc introduces a novel paradigm that combines offline planning with online allocation. The offline planning leverages spatio-temporal regularities to generate a near-optimal allocation plan, while the online allocation handles complex and dynamic models such as Mixture-of-Experts (MoE). Built as a pluggable PyTorch memory allocator, STAlloc reduces fragmentation ratio on average by 85.1% (up to 100%) across both dense and MoE models, with negligible overhead. This enables more efficient, high-throughput training configurations and improves throughput performance by up to 32.5%.
翻译:大型语言模型(LLMs)的快速扩展显著增加了GPU内存压力,而虚拟流水线和重计算等训练优化技术进一步加剧了这一问题,这些技术扰乱了张量的生命周期并引入了显著的内存碎片。此类碎片源于如PyTorch等流行深度学习框架中使用的在线GPU内存分配器,这些分配器忽略了张量的生命周期。因此,这种低效性可能浪费高达43%的内存并触发内存不足错误,从而削弱了优化方法的有效性。为解决此问题,我们提出了STAlloc,一种用于深度学习框架的GPU内存分配器,它通过利用训练工作负载中内存分配行为的时空规律性来减少碎片。STAlloc引入了一种结合离线规划与在线分配的新范式。离线规划利用时空规律生成近乎最优的分配计划,而在线分配则处理如专家混合(MoE)等复杂动态模型。作为可插拔的PyTorch内存分配器,STAlloc在密集模型和MoE模型上将平均碎片率降低了85.1%(最高可达100%),且开销可忽略不计。这使得更高效、高吞吐的训练配置成为可能,并将吞吐性能提升最高达32.5%。