基于角色的LLM RL后训练容错系统 (Role-Based Fault Tolerance System for LLM RL Post-Training)

RL post-training for LLMs has been widely scaled to enhance reasoning and tool-using capabilities. However, RL post-training interleaves training and inference workloads, exposing the system to faults from both sides. Existing fault tolerance frameworks for LLMs target either training or inference, leaving the optimization potential in the asynchronous execution unexplored for RL. Our key insight is role-based fault isolation so the failure in one machine does not affect the others. We treat trainer, rollout, and other management roles in RL training as distinct distributed sub-tasks. Instead of restarting the entire RL task in ByteRobust, we recover only the failed role and reconnect it to living ones, thereby eliminating the full-restart overhead including rollout replay and initialization delay. We present RobustRL, the first comprehensive robust system to handle GPU machine errors for RL post-training Effective Training Time Ratio improvement. (1) \textit{Detect}. We implement role-aware monitoring to distinguish actual failures from role-specific behaviors to avoid the false positive and delayed detection. (2) \textit{Restart}. For trainers, we implement a non-disruptive recovery where rollouts persist state and continue trajectory generation, while the trainer is rapidly restored via rollout warm standbys. For rollout, we perform isolated machine replacement without interrupting the RL task. (3) \textit{Reconnect}. We replace static collective communication with dynamic, UCX-based (Unified Communication X) point-to-point communication, enabling immediate weight synchronization between recovered roles. In an RL training task on a 256-GPU cluster with Qwen3-8B-Math workload under 10\% failure injection frequency, RobustRL can achieve an ETTR of over 80\% compared with the 60\% in ByteRobust and achieves 8.4\%-17.4\% faster in end-to-end training time.

翻译：LLM的RL后训练已被广泛扩展以增强推理与工具使用能力。然而，RL后训练交织了训练与推理工作负载，使系统暴露于来自两方面的故障。现有的LLM容错框架仅针对训练或推理，未能针对RL的异步执行挖掘优化潜力。我们的核心见解是基于角色的故障隔离，使得单台机器的故障不影响其他机器。我们将RL训练中的训练器、轨迹采样器及其他管理角色视为不同的分布式子任务。不同于ByteRobust中重启整个RL任务，我们仅恢复故障角色并将其重新连接到存活角色，从而消除了包括轨迹重放和初始化延迟在内的完全重启开销。我们提出了RobustRL，首个处理RL后训练中GPU机器错误的综合性鲁棒系统，以提升有效训练时间比。(1) \textit{检测}。我们实现角色感知监控，以区分实际故障与角色特定行为，避免误报和延迟检测。(2) \textit{重启}。对于训练器，我们实现非中断式恢复，轨迹采样器保持状态并继续轨迹生成，同时通过轨迹采样器热备快速恢复训练器。对于轨迹采样器，我们在不中断RL任务的情况下执行隔离式机器替换。(3) \textit{重连}。我们用基于UCX（统一通信框架）的动态点对点通信替代静态集合通信，实现恢复角色间的即时权重同步。在256-GPU集群上使用Qwen3-8B-Math工作负载、10\%故障注入频率的RL训练任务中，RobustRL可实现超过80\%的有效训练时间比，相比ByteRobust的60\%，端到端训练时间加快8.4\%-17.4\%。