Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI communication in memory-bound parallel programs on multicore clusters and how it can be facilitated. For instance, slowing down MPI processes by deliberate injection of delays can improve performance if certain conditions are met. This leads to the counter-intuitive conclusion that noise, independent of its source, is not always detrimental but can be leveraged for performance improvements. We employ phase-space graphs as a new tool to visualize parallel program dynamics. They are useful in spotting certain patterns in parallel execution that will easily go unnoticed with traditional tracing tools. We investigate five different microbenchmarks and applications on different supercomputer platforms: an MPI-augmented STREAM Triad, two implementations of Lattice-Boltzmann fluid solvers, and the LULESH and HPCG proxy applications.
翻译:在高度平行的HPC集群方案所展示的复杂硬件-软件互动中,在核心的复杂硬件-软件互动中,综合了性能瓶颈,这一点至关重要。本文件揭示了在多核心集群的记忆性平行方案中自动不同步的MPI通信问题,以及如何促进这种通信。例如,如果某些条件得到满足,通过故意注入延迟来减缓MPI进程可以提高性能。这导致反直觉结论,即噪音与其来源无关,并不总是有害,但可以用来改进性能。我们使用相近空间图作为新的工具,以可视化平行程序动态。它们有助于在平行执行中发现某些模式,这些模式很容易与传统的追踪工具不为人所注意。我们调查了不同超级计算机平台上五个不同的微小标记和应用:MPI推荐的STREAM Triad,两个实施Lattice-Boltzmann液体解决方案,以及LULESH和HPCG代理应用程序。</s>