GPU 分配和本地化模型预测控制有效平行 (Effective GPU Parallelization of Distributed and Localized Model Predictive Control)

To effectively control large-scale distributed systems online, model predictive control (MPC) has to swiftly solve the underlying high-dimensional optimization. There are multiple techniques applied to accelerate the solving process in the literature, mainly attributed to software-based algorithmic advancements and hardware-assisted computation enhancements. However, those methods focus on arithmetic accelerations and overlook the benefits of the underlying system's structure. In particular, the existing decoupled software-hardware algorithm design that naively parallelizes the arithmetic operations by the hardware does not tackle the hardware overheads such as CPU-GPU and thread-to-thread communications in a principled manner. Also, the advantages of parallelizable subproblem decomposition in distributed MPC are not well recognized and exploited. As a result, we have not reached the full potential of hardware acceleration for MPC. In this paper, we explore those opportunities by leveraging GPU to parallelize the distributed and localized MPC (DLMPC) algorithm. We exploit the locality constraints embedded in the DLMPC formulation to reduce the hardware-intrinsic communication overheads. Our parallel implementation achieves up to 50x faster runtime than its CPU counterparts under various parameters. Furthermore, we find that the locality-aware GPU parallelization could halve the optimization runtime comparing to the naive acceleration. Overall, our results demonstrate the performance gains brought by software-hardware co-design with the information exchange structure in mind.

翻译：为了有效控制大规模在线分布式系统,模型预测控制(MPC)必须迅速解决高维优化的基本问题,在文献中应用多种技术加快解析进程,主要归功于基于软件的算法进步和硬件辅助计算增强;然而,这些方法侧重于算术加速,忽视了基础系统结构的好处。特别是,现有的分离软件硬件算法设计使硬件的算术操作天真地平行,但硬件算术操作无法以原则方式解决计算机-GPU和线对线通信等硬件间接费用。此外,在分布式MPC中平行的子问题解构的优点没有得到很好的承认和利用。因此,我们没有达到为MPC提供硬件加速的完全潜力。在本文中,我们探索这些机会,利用GPU将分布式和本地化的MPC(DLMPC)算法平行平行平行平行平行的软件算法。我们利用DLMPC 配置中所包含的地点限制来减少硬件-内涵式通信管理。我们平行的实施工作在50x级间实现了同步的同步速度,让我们的CUFI 更快地展示了我们同步同步的进度。