We give a fast(er), communication-free, parallel construction of optimal communication schedules that allow broadcasting of $n$ distinct blocks of data from a root processor to all other processors in $1$-ported, $p$-processor networks with fully bidirectional communication. For any $p$ and $n$, broadcasting in this model requires $n-1+\lceil\log_2 p\rceil$ communication rounds. In contrast to other constructions, all processors follow the same, circulant graph communication pattern, which makes it possible to use the schedules for the allgather (all-to-all-broadcast) operation as well. The new construction takes $O(\log^3 p)$ time steps per processor, each of which can compute its part of the schedule independently of the other processors in $O(\log p)$ space. The result is a significant improvement over the sequential $O(p \log^2 p)$ time and $O(p\log p)$ space construction of Tr\"aff and Ripke (2009) with considerable practical import. The round-optimal schedule construction is then used to implement communication optimal algorithms for the broadcast and (irregular) allgather collective operations as found in MPI (the \emph{Message-Passing Interface}), and significantly and practically improves over the implementations in standard MPI libraries (\texttt{mpich}, OpenMPI, Intel MPI) for certain problem ranges. The application to the irregular allgather operation is entirely new.
翻译:暂无翻译