RAPID-LLM：面向分布式大语言模型训练与推理的弹性感知基础设施性能分析框架 (RAPID-LLM: Resilience-Aware Performance analysis of Infrastructure for Distributed LLM Training and Inference)

RAPID-LLM is a unified performance modeling framework for large language model (LLM) training and inference on GPU clusters. It couples a DeepFlow-based frontend that generates hardware-aware, operator-level Chakra execution traces from an abstract LLM specification (model shape, batch/sequence settings, training vs. inference, and hybrid parallelism choices) with an extended Astra-Sim backend that executes those traces on explicit multi-dimensional network topologies with congestion-aware routing and support for degraded and faulty links. The frontend assigns per-operator latency using a tile-based model that accounts for SM under-utilization and multi-level memory traffic (SRAM/ L2/ HBM), and prunes memory-infeasible configurations using an activation-liveness traversal under recomputation, parallelism and ZeRO/FDSP sharding policies. Across A100-based validation cases, RAPID-LLM predicts Llama inference step latency and GPT-scale training time per batch within 10.4\% relative to published measurements, and matches ns-3 packet-level results within 8\% on representative communication workloads. Case studies demonstrate how RAPID-LLM enables fast, exhaustive sweeps over hybrid-parallel configurations, quantifies sensitivity to soft link faults under realistic routing and congestion, and evaluates hypothetical GPU design variants including HBM bandwidth throttling effects.

翻译：RAPID-LLM 是一个用于 GPU 集群上大语言模型（LLM）训练与推理的统一性能建模框架。该框架将基于 DeepFlow 的前端与扩展的 Astra-Sim 后端相结合：前端根据抽象的 LLM 规范（模型结构、批次/序列设置、训练与推理模式以及混合并行策略选择）生成硬件感知的、算子级的 Chakra 执行轨迹；后端则在显式的多维网络拓扑上执行这些轨迹，支持拥塞感知路由以及降级与故障链路的模拟。前端采用基于计算块（tile）的模型来分配每个算子的延迟，该模型考虑了流多处理器（SM）利用率不足以及多级内存（SRAM/L2/HBM）流量，并通过在重计算、并行化以及 ZeRO/FDSP 分片策略下的激活活性遍历来剪枝内存不可行的配置。在基于 A100 GPU 的验证案例中，RAPID-LLM 对 Llama 模型推理步骤延迟和 GPT 规模训练每批次时间的预测，与已发布的实测结果相比相对误差在 10.4% 以内；在代表性通信负载上，其预测结果与 ns-3 数据包级仿真结果的匹配度在 8% 以内。案例研究表明，RAPID-LLM 能够实现对混合并行配置的快速、穷举式搜索，量化在实际路由和拥塞情况下对软链路故障的敏感性，并评估包括 HBM 带宽限制效应在内的假设性 GPU 设计变体。

相关内容