Maintaining computational load balance is important to the performant behavior of codes which operate under a distributed computing model. This is especially true for GPU architectures, which can suffer from memory oversubscription if improperly load balanced. We present enhancements to traditional load balancing approaches and explicitly target GPU architectures, exploring the resulting performance. A key component of our enhancements is the introduction of several GPU-amenable strategies for assessing compute work. These strategies are implemented and benchmarked to find the most optimal data collection methodology for in-situ assessment of GPU compute work. For the fully kinetic particle-in-cell code WarpX, which supports MPI+CUDA parallelism, we investigate the performance of the improved dynamic load balancing via a strong scaling-based performance model and show that, for a laser-ion acceleration test problem run with up to 6144 GPUs on Summit, the enhanced dynamic load balancing achieves from 62%--74% (88% when running on 6 GPUs) of the theoretically predicted maximum speedup; for the 96-GPU case, we find that dynamic load balancing improves performance relative to baselines without load balancing (3.8x speedup) and with static load balancing (1.2x speedup). Our results provide important insights into dynamic load balancing and performance assessment, and are particularly relevant in the context of distributed memory applications ran on GPUs.
翻译:维护计算负载平衡对于在分布式计算模型下运行的代码的性能行为十分重要。 GPU 结构尤其如此, 如果负载不适当平衡, 它可能会受到内存超标的影响。 我们展示了传统负负平衡方法的改进, 并明确针对 GPU 结构, 探索由此产生的性能。 我们增强的一个关键组成部分是引入若干 GPU 的可计量计算计算工作评估战略。 这些战略得到实施和基准, 以找到对 GPU 计算工作进行现场评估的最优化数据收集方法。 对于支持 MPI+CUDA 平行的全动粒子细胞代码 WarpX 来说, 我们发现, 动态负载平衡与基线之间的性能平衡, 通过强大的基于缩放的性能模型, 并显示, 对于在峰会上运行高达 6144 GPU 的激光加速测试问题, 增强的负载平衡从理论上预测的最大速度的62%到74%(88%); 对于96 GPU 案例, 我们发现, 动态负负的比比比比基线更强, 的比重比重的负负比重比重比重比重比重, 稳定速度, 速度比重比重的计算。