Monte Carlo Tree Search (MCTS) methods have achieved great success in many Artificial Intelligence (AI) benchmarks. The in-tree operations become a critical performance bottleneck in realizing parallel MCTS on CPUs. In this work, we develop a scalable CPU-FPGA system for Tree-Parallel MCTS. We propose a novel decomposition and mapping of MCTS data structure and computation onto CPU and FPGA to reduce communication and coordination. High scalability of our system is achieved by encapsulating in-tree operations in an SRAM-based FPGA accelerator. To lower the high data access latency and inter-worker synchronization overheads, we develop several hardware optimizations. We show that by using our accelerator, we obtain up to $35\times$ speedup for in-tree operations, and $3\times$ higher overall system throughput. Our CPU-FPGA system also achieves superior scalability wrt number of parallel workers than state-of-the-art parallel MCTS implementations on CPU.
翻译:蒙特卡洛树搜索(MCTS)方法在许多人工智能(AI)基准中取得了巨大成功。 树内作业已成为实现平行CPPS的平行 MCTS的关键性能瓶颈。 在这项工作中,我们开发了一个可伸缩的 CPU- FPGA 系统,用于树皮搜索 MCTS。 我们提议对 MCTS 数据结构进行新的分解和绘图,并将其计算成CPU 和 FPGA 系统,以减少通信和协调。 我们系统的高度可扩缩性是通过在基于SRAM的 SPGA 加速器中封存树内作业来实现的。 为了降低高数据存性和高工作间同步性,我们开发了几种硬件优化。 我们通过使用我们的加速器,我们获得了在树内操作的35美元速度和3美元以上整个系统吞吐量。 我们的CPU- FPGA 系统也实现了比在CPU上州平行的MCTS执行系统更高级的可伸缩性。