State-of-the-art parallel sorting algorithms for distributed-memory architectures are based on computing a balanced partitioning via sampling and histogramming. By finding samples that partition the sorted keys into evenly-sized chunks, these algorithms minimize the number of communication rounds required. Histogramming (computing positions of samples) guides sampling, enabling a decrease in the overall number of samples collected. We derive lower and upper bounds on the number of sampling/histogramming rounds required to compute a balanced partitioning. We improve on prior results to demonstrate that when using $p$ processors/parts, $O(\log^* p)$ rounds with $O(p/\log^* p)$ samples per round suffice. We match that with a lower bound that shows any algorithm requires at least $\Omega(\log^* p)$ rounds with $O(p)$ samples per round. Additionally, we prove the $\Omega(p \log p)$ samples lower bound for one round, showing the optimality of sample sort in this case. To derive the lower bound, we propose a hard randomized input distribution and apply classical results from the distribution theory of runs.
翻译:通过取样和直方图绘制,对分布式模版结构进行最先进的平行排序算法,其基础是通过取样和直方图绘制来计算平衡的分区。通过找到将分类键分割成平整块的样本,这些算法最大限度地减少了所需的通信轮数。 直方图( 计算样品的位置) 引导抽样, 使所采集的样品总数减少。 我们从计算平衡分区所需的取样/ 测序轮数中得出下限和上界值。 我们改进了先前的结果, 以显示在使用美元处理器/部件时, 美元( log\\ p) 美元( log\ p) 圆轮数将分解成平均大小的块数。 我们用显示任何算法的较低约束值至少需要$\ Omega (\ log\ p) 圆( p) 来匹配每轮数。 此外, 我们证明 $\ Omega (p\ log p) 的样品在每轮数上下限值为一回合, 显示本案例的样品类型类型的最佳性排序。 我们提出一个最下分布的随机分析结果。