We discuss a simple, binary tree-based algorithm for the collective allreduce (reduction-to-all, MPI_Allreduce) operation for parallel systems consisting of $p$ suitably interconnected processors. The algorithm can be doubly pipelined to exploit bidirectional (telephone-like) communication capabilities of the communication system. In order to make the algorithm more symmetric, the processors are organized into two rooted trees with communication between the two roots. For each pipeline block, each non-leaf processor takes three communication steps, consisting in receiving and sending from and to the two children, and sending and receiving to and from the root. In a round-based, uniform, linear-cost communication model in which simultaneously sending and receiving $n$ data elements takes time $\alpha+\beta n$ for system dependent constants $\alpha$ (communication start-up latency) and $\beta$ (time per element), the time for the allreduce operation on vectors of $m$ elements is $O(\log p+\sqrt{m\log p})+3\beta m$ by suitable choice of the pipeline block size. We compare the performance of an implementation in MPI to similar reduce followed by broadcast algorithms, and the native MPI_Allreduce collective on a modern, small $36\times 32$ processor cluster. With proper choice of the number of pipeline blocks, it is possible to achieve better performance than pipelined algorithms that do not exploit bidirectional communication.
翻译:我们讨论由美元构成的由美元相宜互连处理器组成的平行系统(从减少到所有,MPI_Allduce)运行的简单、二进制树算法。该算法可以加倍编导,以利用通信系统的双向(类似电话的)通信能力。为了使算法更加对称,处理器将分为两个根部之间通信的根树。对于每个管道块,每个非单向处理器采取三个通信步骤,包括接收和发送两个子群,发送和接收根部。在一个基于双向的、统一的、线性价的通信模型中,同时发送和接收一美元数据元素需要时间来利用双向(双向的)通信能力。为使算法更加对称,处理器将分为两个根部之间有沟通的根根树。对于每个管道块而言,每个非单端处理器处理器的运行时间是三个通信步骤,即接收和发送两个子组,发送和接收到根部之间的发送和接收。在一个双向的双向通信模式的双向通信模型中,通过适当的计算,通过适当的计算方式,可以降低输电路段的运行。