With deep reinforcement learning (RL) methods achieving results that exceed human capabilities in games, robotics, and simulated environments, continued scaling of RL training is crucial to its deployment in solving complex real-world problems. However, improving the performance scalability and power efficiency of RL training through understanding the architectural implications of CPU-GPU systems remains an open problem. In this work we investigate and improve the performance and power efficiency of distributed RL training on CPU-GPU systems by approaching the problem not solely from the GPU microarchitecture perspective but following a holistic system-level analysis approach. We quantify the overall hardware utilization on a state-of-the-art distributed RL training framework and empirically identify the bottlenecks caused by GPU microarchitectural, algorithmic, and system-level design choices. We show that the GPU microarchitecture itself is well-balanced for state-of-the-art RL frameworks, but further investigation reveals that the number of actors running the environment interactions and the amount of hardware resources available to them are the primary performance and power efficiency limiters. To this end, we introduce a new system design metric, CPU/GPU ratio, and show how to find the optimal balance between CPU and GPU resources when designing scalable and efficient CPU-GPU systems for RL training.
翻译:深入强化学习(RL)方法取得了超过人类在游戏、机器人和模拟环境中能力的成果,因此,继续扩大RL培训的规模对于部署它解决复杂的现实世界问题至关重要,然而,通过理解CPU-GPU系统对建筑的影响,提高REL培训的性能和功率效率,仍然是个未解决的问题。在这项工作中,我们不仅从GPU微结构角度,而且采用系统级综合分析方法,来处理这一问题,从而调查并改进CPU-GPU系统分散的RL培训的性能和功率效率。我们用最先进的分布式RL培训框架量化了总硬件利用率,从经验上查明了GPU微结构、算法和系统级设计选择造成的瓶颈。我们表明,GPU微结构本身对于最先进的RL框架来说是十分平衡的,但进一步的调查显示,运行环境相互作用的行为者数量以及他们可用的硬件资源数量是主要的性能和功率限制。为此,我们引入了一种由GPU微结构、算法和系统设计最佳的CPU/G资源在最佳系统设计上找到最佳的CPU/CR资源。