SPHINCS+ is a stateless hash-based signature scheme that provides strong post quantum security, but its signature generation is slow due to intensive hash computations. GPUs offer massive parallelism that can potentially accelerate SPHINCS+ signatures. However, existing GPU-based optimizations either fail to fully exploit the inherent parallelism of SPHINCS+'s Merkle tree structure or lack fine-grained, compiler-level customization across its diverse computational kernels. This paper proposes HERO Sign, a GPU-accelerated SPHINCS+ implementation that adopts hierarchical tuning and efficient compiler time optimizations. HERO Sign reexamines the parallelization opportunities enabled by data independence across SPHINCS+ components, including FORS, MSS, and WOTS+. It introduces a Tree Fusion strategy for FORS, which contains a large number of independent branches. The fusion strategy is guided by an automated Tree Tuning search algorithm that adapts fusion schemes to different GPU architectures. To further improve performance, HERO Sign employs an adaptive compilation strategy that accounts for the varying effectiveness of compiler optimizations across SPHINCS+ kernels such as FORS Sign, TREE Sign, and WOTS+ Sign. During compilation, the strategy automatically selects between PTX and native code paths to maximize efficiency. For batched signature generation, HERO Sign optimizes kernel-level overlapping using a task graph-based construction to reduce multi-stream idle time and kernel launch overhead. Experimental results show that, compared to state of the art GPU implementations, HERO Sign achieves throughput improvements of 1.28-3.13, 1.28-2.92, and 1.24-2.60 under the SPHINCS+ 128f, 192f, and 256f parameter sets on RTX 4090. Similar gains are observed on A100, H100, and GTX 2080, along with a two orders of magnitude reduction in kernel launch latency.
翻译:SPHINCS+是一种无状态的基于哈希的签名方案,具有强大的后量子安全性,但其签名生成过程因密集的哈希计算而较为缓慢。GPU提供了大规模并行计算能力,有望加速SPHINCS+签名生成。然而,现有的基于GPU的优化方案要么未能充分利用SPHINCS+默克尔树结构固有的并行性,要么缺乏针对其多样化计算内核的细粒度、编译器级别的定制能力。本文提出了HERO Sign,一种采用分层调优与高效编译时优化的GPU加速SPHINCS+实现方案。HERO Sign重新审视了由SPHINCS+各组件(包括FORS、MSS和WOTS+)之间的数据独立性所带来的并行化机会。针对包含大量独立分支的FORS,本文引入了树融合策略。该策略由一种自动化的树调优搜索算法指导,能够使融合方案适应不同的GPU架构。为了进一步提升性能,HERO Sign采用了一种自适应编译策略,该策略考虑了编译器优化在FORS Sign、TREE Sign和WOTS+ Sign等不同SPHINCS+内核间效果差异。在编译过程中,该策略自动在PTX代码路径和原生代码路径之间进行选择,以最大化效率。对于批量签名生成,HERO Sign利用基于任务图的构建方法优化内核级重叠,以减少多流空闲时间和内核启动开销。实验结果表明,与最先进的GPU实现相比,在RTX 4090上,针对SPHINCS+ 128f、192f和256f参数集,HERO Sign分别实现了1.28-3.13倍、1.28-2.92倍和1.24-2.60倍的吞吐量提升。在A100、H100和GTX 2080上也观察到了类似的性能增益,同时内核启动延迟降低了两个数量级。