In this work, we present a well-optimized GPU implementation of Dilithium, one of the NIST post-quantum standard digital signature algorithms. We focus on warp-level design and exploit several strategies to improve performance, including memory pool, kernel fusing, batching, streaming, etc. All the above efforts lead to an efficient and high-throughput solution. We profile on both desktop and server-grade GPUs, and achieve up to 57.7$\times$, 93.0$\times$, and 63.1$\times$ higher throughput on RTX 3090Ti for key generation, signing, and verification, respectively, compared to single-thread CPU. Additionally, we study the performance in real-world applications to demonstrate the effectiveness and applicability of our solution.
翻译:在这项工作中,我们展示了一种优化的GPU对Dilithium的GPU实施,这是NIST后量级标准数字签名算法之一,我们侧重于曲速级设计和利用若干战略提高性能,包括记忆池、内核引信、批发、流流等。所有上述努力都导致高效和高通量的解决方案。我们在桌面和服务器级GPU上进行剖析,在关键一代、签字和核查方面达到57.7美元、93.0美元/日元和63.1美元/日元的RTX 3090T的通过量。此外,我们研究了现实世界应用中的性能,以证明我们解决方案的有效性和适用性。