We present the implementation of a trust-region Newton algorithm ExaTron for bound-constrained nonlinear programming problems, fully running on multiple GPUs. Without data transfers between CPU and GPU, our implementation has achieved the elimination of a major performance bottleneck under a memory-bound situation, particularly when solving many small problems in batch. We discuss the design principles and implementation details for our kernel function and core operations. Different design choices are justified by numerical experiments. By using the application of distributed control of alternating current optimal power flow, where a large problem is decomposed into many smaller nonlinear programs using a Lagrangian approach, we demonstrate computational performance of ExaTron on the Summit supercomputer at Oak RidgeNational Laboratory. Our numerical results show the linear scaling with respect to the batch size and the number of GPUs and more than 35 times speedup on 6 GPUs than on 40 CPUs available on a single node.
翻译:我们介绍了对受约束的非线性编程问题实施信任区的牛顿算法ExaTron, 完全在多个 GPU 上运行。 没有数据在CPU 和 GPU 之间传输, 我们的实施工作已经消除了记忆内存情况下的主要性能瓶颈, 特别是在解决许多小批量问题时。 我们讨论了我们内核功能和核心操作的设计原则和实施细节。 不同的设计选择是用数字实验来证明的。 通过应用对交替当前最佳电流的分散控制, 将一个大问题分解成许多较小的非线性程序, 我们在橡树脊国家实验室的顶顶级超级计算机上展示ExaTRon的计算性能。 我们的数字结果显示了与批量大小和GPU的数量有关的线性缩放, 6 GPU 的加速度比单一节点上40个 CPU的加速度超过35倍。