GPU systems are increasingly powering modern datacenters at scale. Despite being highly performant, GPU systems suffer from performance variation at the node and cluster levels. Such performance variation significantly impacts both high-performance computing and artificial intelligence workloads, such as cutting-edge large language models (LLMs). We analyze the performance of a single-node multi-GPU system running LLM training, and observe that the kernel-level performance variation is highly correlated with concurrent computation communication (C3), a technique to overlap computation and communication across GPUs for performance gains. We then take a further step to reason that thermally induced straggling coupling with C3 impacts performance variation, coined as the Lit Silicon effect. Lit Silicon describes that in a multi-GPU node, thermal imbalance across GPUs introduces node-level straggler GPUs, which in turn slow down the leader GPUs. Lit Silicon leads to node-level performance variation and inefficiency, impacting the entire datacenter from the bottom up. We propose analytical performance and power models for Lit Silicon, to understand the potential system-level gains. We further design simple detection and mitigation techniques to effectively address the Lit Silicon problem, and evaluate three different power management solutions, including power optimization under GPU thermal design power, performance optimization under node-level GPU power capping, and performance optimization under node-level CPU power sloshing. We conduct experiments on two workloads on two AMD InstinctTM MI300X GPU systems under two LLM training frameworks, and observe up to 6% performance and 4% power improvements, potentially saving hundreds of millions of dollars in datacenters. Our solution is almost free lunch and can be effortlessly adopted in datacenters as a new node-level power management layer.
翻译:GPU系统正日益大规模地为现代数据中心提供算力支持。尽管GPU系统性能卓越,但在节点和集群层面仍存在性能波动问题。这种性能波动对高性能计算和人工智能工作负载(如前沿的大语言模型LLMs)均产生显著影响。我们分析了运行LLM训练的单节点多GPU系统的性能,发现内核级性能波动与并发计算通信(C3)高度相关。C3是一种通过重叠GPU间的计算与通信以提升性能的技术。我们进一步推断,由热效应引起的拖尾现象与C3耦合影响了性能波动,这一现象被定义为Lit Silicon效应。Lit Silicon描述的是:在多GPU节点中,GPU间的热失衡会引发节点级的拖尾GPU,进而拖慢领先GPU的速度。Lit Silicon导致节点级性能波动和效率低下,自下而上地影响整个数据中心。我们提出了Lit Silicon的分析性能与功耗模型,以理解潜在的系统级收益。此外,我们设计了简单的检测与缓解技术来有效应对Lit Silicon问题,并评估了三种不同的功耗管理方案,包括:GPU热设计功耗下的功耗优化、节点级GPU功耗封顶下的性能优化,以及节点级CPU功耗波动下的性能优化。我们在两个AMD Instinct™ MI300X GPU系统上,基于两种LLM训练框架对两个工作负载进行了实验,观察到最高6%的性能提升和4%的功耗改善,有望为数据中心节省数亿美元成本。我们的解决方案近乎零成本,可作为新的节点级功耗管理层在数据中心中轻松部署。