Top-k selection, which identifies the largest or smallest k elements from a data set, is a fundamental operation in data-intensive domains such as databases and deep learning, so its scalability and efficiency are critical for these high-performance systems. However, previous studies on its efficient GPU implementation are mostly merge-based and rely heavily on the fast but size-limited on-chip memory, thereby limiting the scalability with a restricted upper bound on k. This work introduces a scalable and optimized GPU-parallel radix top-k selection that supports significantly larger k values than existing methods without compromising efficiency, regardless of input length and batch size. Our method incorporates a novel optimization framework tailored for high memory bandwidth and resource utilization, achieving up to 2.5x speedup over the prior art for non-batch queries and up to 4.8x speedup for batch queries. In addition, we propose an adaptive scaling technique that strengthens the robustness, which further provides up to 2.7x speedup on highly adversarial input distributions.
翻译:暂无翻译