The performance of highly parallel applications on distributed-memory systems is influenced by many factors. Analytic performance modeling techniques aim to provide insight into performance limitations and are often the starting point of optimization efforts. However, coupling analytic models across the system hierarchy (socket, node, network) fails to encompass the intricate interplay between the program code and the hardware, especially when execution and communication bottlenecks are involved. In this paper we investigate the effect of "bottleneck evasion" and how it can lead to automatic overlap of communication overhead with computation. Bottleneck evasion leads to a gradual loss of the initial bulk-synchronous behavior of a parallel code so that its processes become desynchronized. This occurs most prominently in memory-bound programs, which is why we choose memory-bound benchmark and application codes, specifically an MPI-augmented STREAM Triad, sparse matrix-vector multiplication, and a collective-avoiding Chebyshev filter diagonalization code to demonstrate the consequences of desynchronization on two different supercomputing platforms. We investigate the role of idle waves as possible triggers for desynchronization and show the impact of automatic asynchronous communication for a spectrum of code properties and parameters, such as saturation point, matrix structures, domain decomposition, and communication concurrency. Our findings reveal how eliminating synchronization points (such as collective communication or barriers) precipitates performance improvements that go beyond what can be expected by simply subtracting the overhead of the collective from the overall runtime.
翻译:在分布式模拟系统中,高度平行应用系统的性能受到许多因素的影响。分析性性能模型技术旨在提供对性能限制的洞察力,而且往往是优化努力的起点。然而,在系统层级(软盘、节点、网络)中,将分析模型组合在一起,无法涵盖程序代码和硬件之间的复杂相互作用,特别是在涉及执行和通信瓶颈的情况下。在本文件中,我们调查“瓶颈规避”的影响以及它如何导致通信间接费用与计算方法的自动重叠。瓶颈性能模型技术导致平行性能代码最初的散装同步行为逐渐丧失,从而使其过程失去同步性。这在有记忆的程序中最为突出,这就是为什么我们选择了以记忆为基础的基准和应用代码,特别是MPI 推荐的STREAM Triad, 矩阵- Victor 倍增, 以及集体沉积化Chebyshev 过滤分解码代码,以显示两个不同超交错平台的分解性总体同步性改进的后果。我们调查的是,空心性通信波的预期作用,即自动递化的轨道结构,即自动递解变变变变变变变变的系统结构,从而显示我们的周期性变变变的内变变变变变变的变变变变变变。