高性能和灵活的并行算法：Semisort和相关问题 (High-Performance and Flexible Parallel Algorithms for Semisort and Related Problems)

Semisort is a fundamental algorithmic primitive widely used in the design and analysis of efficient parallel algorithms. It takes input as an array of records and a function extracting a \emph{key} per record, and reorders them so that records with equal keys are contiguous. Since many applications only require collecting equal values, but not fully sorting the input, semisort is broadly applicable, e.g., in string algorithms, graph analytics, and geometry processing, among many other domains. However, despite dozens of recent papers that use semisort in their theoretical analysis and the existence of an asymptotically optimal parallel semisort algorithm, most implementations of these parallel algorithms choose to implement semisort by using comparison or integer sorting in practice, due to potential performance issues in existing semisort implementations. In this paper, we revisit the semisort problem, with the goal of achieving a high-performance parallel semisort implementation with a flexible interface. Our approach can easily extend to two related problems, \emph{histogram} and \emph{collect-reduce}. Our algorithms achieve strong speedups in practice, and importantly, outperform state-of-the-art parallel sorting and semisorting methods for almost all settings we tested, with varying input sizes, distribution, and key types. We also test two important applications with real-world data, and show that our algorithms improve the performance over existing approaches. We believe that many other parallel algorithm implementations can be accelerated using our results.

翻译：Semisort是一种广泛用于设计和分析高效并行算法的基本算法原语。它以记录的数组和提取每个记录“键”的函数作为输入，并重新排序，使具有相等键的记录连续。由于许多应用程序仅需要收集相等的值，而不是完全排序输入，因此semisort具有广泛的适用性，例如字符串算法，图形分析和几何处理等许多领域。然而，尽管近年来有许多使用semisort进行理论分析的论文，并且存在一种渐近最优的并行semisort算法，但由于现有semisort实现中的潜在性能问题，大多数并行算法的实现实际上选择使用比较或整数排序来实现semisort。在本文中，我们重新审视semisort问题，旨在实现高性能的并行semisort实现和灵活的接口。我们的方法可以轻松扩展到两个相关问题：histogram和collect-reduce。我们的算法在实践中实现了强大的加速，并且重要的是，在我们测试的几乎所有设置中，包括不同的输入大小、分布和键类型，都优于最先进的并行排序和semisort方法。我们还使用实际数据测试了两个重要应用程序，并展示了我们的算法比现有方法提高了性能。我们相信，许多其他的并行算法实现可以使用我们的结果加速。