RL in the Wild: 大语言模型部署中RLVR训练的特征分析 (RL in the Wild: Characterizing RLVR Training in LLM Deployment)

Large Language Models (LLMs) are now widely used across many domains. With their rapid development, Reinforcement Learning with Verifiable Rewards (RLVR) has surged in recent months to enhance their reasoning and understanding abilities. However, its complex data flows and diverse tasks pose substantial challenges to RL training systems, and there is limited understanding of RLVR from a system perspective. To thoroughly understand the system challenges introduced by RLVR, we present a characterization study of RLVR tasks in our LLM deployment. Specifically, we investigate the distribution and variation trends of workloads across different RL tasks across training steps. We identify issues such as GPU idling caused by skewed sequence length distribution, inefficient parallel strategies in dynamically varying workloads, inefficient data management mechanisms, and load imbalance. We describe our observations and call for further investigation into the remaining open challenges. Furthermore, we propose PolyTrace benchmark suite to conduct evaluation with realistic workloads, and a practical use case validates that PolyTrace benchmark suite exhibits 94.7% accuracy.

翻译：大语言模型（LLMs）现已在众多领域得到广泛应用。随着其快速发展，基于可验证奖励的强化学习（RLVR）近几个月来迅速兴起，旨在增强其推理与理解能力。然而，其复杂的数据流和多样化的任务对RL训练系统构成了巨大挑战，并且从系统视角对RLVR的理解尚不充分。为了深入理解RLVR带来的系统挑战，我们对LLM部署中的RLVR任务进行了特征研究。具体而言，我们调查了不同RL任务在训练步骤间工作负载的分布与变化趋势。我们识别出诸如由序列长度分布不均导致的GPU闲置、动态变化工作负载中低效的并行策略、低效的数据管理机制以及负载不均衡等问题。我们描述了这些观察结果，并呼吁对尚未解决的开放挑战进行进一步研究。此外，我们提出了PolyTrace基准测试套件，用于在真实工作负载下进行评估，一个实际用例验证了PolyTrace基准测试套件具有94.7%的准确率。