The collective operations are considered critical for improving the performance of exascale-ready and high-performance computing applications. On this paper we focus on the Message-Passing Interface (MPI) Allgather many to many collective, which is amongst the most called and time-consuming operations. Each MPI algorithm for this call suffers from different operational and performance limitations, that might include only working for restricted cases, requiring linear amounts of communication steps with the growth in number of processes, memory copies and shifts to assure correct data organization, and non-local data exchange patterns, most of which negatively contribute to the total operation time. All these characteristics create an environment where there is no algorithm which is the best for all cases and this consequently implies that careful choices of alternatives must be made to execute the call. Considering such aspects, we propose the Stripe Parallel Binomial Trees (Sparbit) algorithm, which has optimal latency and bandwidth time costs with no usage restrictions. It also maintains a much more local communication pattern that minimizes the delays due to long range exchanges, allowing the extraction of more performance from current systems when compared with asymptotically equivalent alternatives. On its best scenario, Sparbit surpassed the traditional MPI algorithms on 46.43% of test cases with mean (median) improvements of 34.7% (26.16%) and highest reaching 84.16%.
翻译:集体操作被认为对于提高快速准备和高性能计算应用程序的性能至关重要。 在本文中,我们侧重于“信息发送界面”(MPI) 将许多组合到许多集体,这是最需要和最费时的操作。 每一个MPI的算法都存在不同的操作和性能限制, 可能包括只为有限的案例而工作, 需要线性通信步骤, 以确保正确的数据组织、 记忆副本和移动, 以及非本地数据交换模式, 其中大部分对总操作时间有负面影响。 所有这些特征都创造了一个环境, 没有最适合所有案例的算法, 因而意味着必须谨慎地选择替代方法来执行号召。 考虑到这些方面, 我们提议“ 双双双双双双弦树( Sparbit) 算法, 它具有最佳的耐用率和带宽时间成本, 而没有使用限制。 它还保持了更远得多的本地通信模式, 最大限度地减少由于远程交换而产生的延误, 使得在与等值替代品相比, 能够从当前系统中提取更多功能。