RPU: 环处理单元 (RPU: The Ring Processing Unit)

Deepraj Soni,Negar Neda,Naifeng Zhang,Benedict Reynwar,Homer Gamil,Benjamin Heyman,Mohammed Nabeel,Ahmad Al Badawi,Yuriy Polyakov,Kellie Canida,Massoud Pedram,Michail Maniatakos,David Bruce Cousins,Franz Franchetti,Matthew French,Andrew Schmidt,Brandon Reagen

Ring-Learning-with-Errors (RLWE) has emerged as the foundation of many important techniques for improving security and privacy, including homomorphic encryption and post-quantum cryptography. While promising, these techniques have received limited use due to their extreme overheads of running on general-purpose machines. In this paper, we present a novel vector Instruction Set Architecture (ISA) and microarchitecture for accelerating the ring-based computations of RLWE. The ISA, named B512, is developed to meet the needs of ring processing workloads while balancing high-performance and general-purpose programming support. Having an ISA rather than fixed hardware facilitates continued software improvement post-fabrication and the ability to support the evolving workloads. We then propose the ring processing unit (RPU), a high-performance, modular implementation of B512. The RPU has native large word modular arithmetic support, capabilities for very wide parallel processing, and a large capacity high-bandwidth scratchpad to meet the needs of ring processing. We address the challenges of programming the RPU using a newly developed SPIRAL backend. A configurable simulator is built to characterize design tradeoffs and quantify performance. The best performing design was implemented in RTL and used to validate simulator performance. In addition to our characterization, we show that a RPU using 20.5mm2 of GF 12nm can provide a speedup of 1485x over a CPU running a 64k, 128-bit NTT, a core RLWE workload

翻译：环同态加密和后量子密码学等许多提高安全性和隐私性的重要技术已经采用了以 RLWE（环学习问题＋错误问题）为基础的技术。尽管这些技术具有很大的潜力，但由于在通用计算机上运行的极高开销，这些技术的使用受到了限制。在本文中，我们提出了一种新颖的向量指令集体系架构（ISA）和微架构，以加速 RLWE 的环基计算。ISA 名为 B512，是针对环处理工作负载而开发的，同时平衡了高性能和通用编程支持。而使用 ISA 而不是固定硬件有助于在制造后继续软件改进和支持不断发展的工作负载。接着我们提出了环处理单元（RPU），它是 B512 的高性能、模块化实现。RPU 具有本地大字模数算术支持，能进行非常宽的并行处理，同时还具备大容量高带宽暂存器，以满足环处理的需求。我们使用新开发的 SPIRAL 后端来解决 RPU 编程方面的挑战。构建了一个可配置的模拟器来表征设计权衡和量化性能。最佳性能的设计已经被实现在 RTL 上，并用于验证模拟器的性能。除了我们的表征之外，我们还表明，使用 GF 12nm 中的 20.5mm2，RPU 可以提供比运行 64k、128 位 NTT 的 CPU 微核心 RLWE 工作负载提高 1485 倍的加速。