Deep reinforcement learning (RL) is a powerful framework to train decision-making models in complex environments. However, RL can be slow as it requires repeated interaction with a simulation of the environment. In particular, there are key system engineering bottlenecks when using RL in complex environments that feature multiple agents with high-dimensional state, observation, or action spaces. We present WarpDrive, a flexible, lightweight, and easy-to-use open-source RL framework that implements end-to-end deep multi-agent RL on a single GPU (Graphics Processing Unit), built on PyCUDA and PyTorch. Using the extreme parallelization capability of GPUs, WarpDrive enables orders-of-magnitude faster RL compared to common implementations that blend CPU simulations and GPU models. Our design runs simulations and the agents in each simulation in parallel. It eliminates data copying between CPU and GPU. It also uses a single simulation data store on the GPU that is safely updated in-place. WarpDrive provides a lightweight Python interface and flexible environment wrappers that are easy to use and extend. Together, this allows the user to easily run thousands of concurrent multi-agent simulations and train on extremely large batches of experience. Through extensive experiments, we verify that WarpDrive provides high-throughput and scales almost linearly to many agents and parallel environments. For example, WarpDrive yields 2.9 million environment steps/second with 2000 environments and 1000 agents (at least 100x higher throughput compared to a CPU implementation) in a benchmark Tag simulation. As such, WarpDrive is a fast and extensible multi-agent RL platform to significantly accelerate research and development.
翻译:深加学习( RL) 是用于在复杂环境中培训决策模型的强大框架。 然而, RL 可能会缓慢, 因为它需要反复与环境模拟进行互动。 特别是, 在复杂环境中使用 RL 存在关键的系统工程瓶颈, 复杂环境中使用 RL 具有高维状态、 观测或动作空间的多个代理器。 我们展示了WarpDrive, 是一个灵活、 轻量和易于使用的开放源的 RL 框架, 在一个 GPU (Graphics 处理股) 上实施端到端的深多试剂 RLL 。 在 PyCUDA 和 PyTorrch 上建立起来。 使用GYCUPS 极高平行能力, WarpD 能够快速地同步运行RLL, 与共同实施共同的 CPU 模拟模型。 我们的设计运行的模拟和每次模拟中, 消除了 CPUP 和 GPUP 之间的数据复制。 它还使用一个单一的模拟数据存储存储存储存储存储存储存储器存储器存储器, 在2000年的Pyvlenter 界面中, 界面中, 和灵活的运行中, 快速运行中可以提供一个快速运行一个快速的快速的快速的快速操作操作操作操作操作的快速操作, 。