Recent works showed that implementations of Quicksort using vector CPU instructions can outperform the non-vectorized algorithms in widespread use. However, these implementations are typically single-threaded, implemented for a particular instruction set, and restricted to a small set of key types. We lift these three restrictions: our proposed 'vqsort' algorithm integrates into the state-of-the-art parallel sorter 'ips4o', with a geometric mean speedup of 1.59. The same implementation works on seven instruction sets (including SVE and RISC-V V) across four platforms. It also supports floating-point and 16-128 bit integer keys. To the best of our knowledge, this is the fastest sort for non-tuple keys on CPUs, up to 20 times as fast as the sorting algorithms implemented in standard libraries. This paper focuses on the practical engineering aspects enabling the speed and portability, which we have not yet seen demonstrated for a Quicksort implementation. Furthermore, we introduce compact and transpose-free sorting networks for in-register sorting of small arrays, and a vector-friendly pivot sampling strategy that is robust against adversarial input.
翻译:最近的工作显示,使用矢量 CPU 的Quicksort 使用矢量 CPU 指令的实施可以超越广泛使用的非矢量算法。 但是, 这些执行通常都是单向值, 用于特定的指令集, 并限于一小组关键类型 。 我们取消了这三项限制: 我们提议的“ vqsort ” 算法将“ ips4o” 整合到最先进的平行排序器“ ips4o” 中, 其速度为1.59。 四个平台的七个指令组( 包括 SVE 和 RISC- V V) 的同一执行工作。 它还支持浮动点和 16- 128 位整形键 。 根据我们的知识, 这是CPUs 上非列键的最快类型, 最多为标准库中执行排序算法的20倍。 本文侧重于能够速度和可移动性的实际工程方面, 我们尚未看到用于快速执行的7个指令组( 包括 SVEVE 和 RISC-V V) 。 此外, 我们引入了不易位化和不设位的网络的网络网络网络, 以对抗对等式战略进行稳的试样的矩阵。