State-of-the-art parallel sorting algorithms for distributed-memory architectures are based on computing a balanced partitioning via sampling and histogramming. By finding samples that partition the sorted keys into evenly-sized chunks, these algorithms minimize the number of communication rounds required. Histogramming (computing positions of samples) guides sampling, enabling a decrease in the overall number of samples collected. We derive lower and upper bounds on the number of sampling/histogramming rounds required to compute a balanced partitioning. We improve on prior results to demonstrate that when using $p$ processors, $O(\log^* p)$ rounds with $O(p/\log^* p)$ samples per round suffice. We match that with a lower bound that shows that any algorithm with $O(p)$ samples per round requires at least $\Omega(\log^* p)$ rounds. Additionally, we prove the $\Omega(p \log p)$ samples lower bound for one round, thus proving that existing one round algorithms: sample sort, AMS sort and HSS have optimal sample size complexity. To derive the lower bound, we propose a hard randomized input distribution and apply classical results from the distribution theory of runs.
翻译:通过取样和直方图绘制,对分布式模拟结构进行最先进的平行排序算法,其基础是通过取样和直方图绘制来计算平衡的分隔法。通过找到将分类键分割成平均大小块的样本,这些算法最大限度地减少了所需的通信轮数。 直方图( 计算样品的位置) 引导取样, 使所采集样品的总数减少。 我们从计算平衡分区所需的取样/ 希方位数中得出下方和上方的界限。 我们改进了先前的结果,以证明在使用美元处理器时, $O( log) p( $ p) 每轮用$O( p/\ log) p( p) 的样本将分解成平均大小。 我们比较了下限的算法, 显示每轮样本中含有$O( p) 的算法至少需要$\ Omega (\ log) p) 。 此外, 我们证明了美元( p\log p) 的样本数量要小于一回合, 证明现有的一次圆算法: 样本的样本排序、 AMS 和 HSS 最精确的配置的样本配置的模型分析结果要由我们提出。