RL has made groundbreaking advancements in robotic, datacenter managements and other applications. Unfortunately, system-level bottlenecks in RL workloads are poorly understood; we observe fundamental structural differences in RL workloads that make them inherently less GPU-bound than supervised learning (SL) including gathering training data in simulation, high-level code that frequently transitions to ML backends, and smaller neural networks. To explain where training time is spent in RL workloads, we propose RL-Scope, a cross-stack profiler that scopes low-level CPU/GPU resource usage to high-level algorithmic operations, and provides accurate insights by correcting for profiling overhead. We demonstrate RL-Scope's utility through in-depth case studies. First, we compare RL frameworks to quantify the effects of fundamental design choices behind ML backends. Next, we survey how training bottlenecks change as we consider different simulators and RL algorithms. Finally, we profile a scale-up workload and demonstrate that GPU utilization metrics reported by commonly-used tools dramatically inflate GPU usage, whereas RL-Scope reports true GPU-bound time. RL-Scope is an open-source tool available at https://github.com/UofT-EcoSystem/rlscope .
翻译:在机器人、数据中心管理和其他应用方面,RL取得了突破性的进展。不幸的是,RL工作量的系统级瓶颈没有得到很好理解;我们观察到了RL工作量的根本性结构差异,使得这些工作量与监督的学习(SL)相比,本质上较少受GPU约束(SL),其中包括在模拟、经常向ML后端过渡的高层次代码以及较小的神经网络中收集培训数据。为了解释培训时间用于RL工作量中的时间,我们建议了RL-Scope,一个跨堆式配置仪,将低水平 CPU/GPU资源的使用范围扩大到高级别的算法操作,并通过纠正剖析间接费用提供了准确的洞见。我们通过深入的案例研究展示了RL-Scope的效用。首先,我们比较了RL框架,以量化ML后端基本设计选择的影响。我们调查了在考虑不同的模拟器和RL算法时培训瓶颈的变化情况。最后,我们描绘了规模扩大的工作量,并展示了GPU使用通用工具在GPU的GPU-RU使用率/TRUT上,而RPS-Sloepal-Lsreal-IOS报告是真正的工具。