利用ASIC AI芯片实现同态加密 (Leveraging ASIC AI Chips for Homomorphic Encryption)

Jianming Tong,Tianhao Huang,Jingtian Dang,Leo de Castro,Anirudh Itagi,Anupam Golder,Asra Ali,Jeremy Kun,Jevin Jiang, Arvind,G. Edward Suh,Tushar Krishna

from arxiv, IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2026; 18 pages, 16 figures, 5 algorithms, 10 tables. Leveraging Google TPUs for Homomorphic Encryption

Homomorphic Encryption (HE) provides strong data privacy for cloud services but at the cost of prohibitive computational overhead. While GPUs have emerged as a practical platform for accelerating HE, there remains an order-of-magnitude energy-efficiency gap compared to specialized (but expensive) HE ASICs. This paper explores an alternate direction: leveraging existing AI accelerators, like Google's TPUs with coarse-grained compute and memory architectures, to offer a path toward ASIC-level energy efficiency for HE. However, this architectural paradigm creates a fundamental mismatch with SoTA HE algorithms designed for GPUs. These algorithms rely heavily on: (1) high-precision (32-bit) integer arithmetic to now run on a TPU's low-throughput vector unit, leaving its high-throughput low-precision (8-bit) matrix engine (MXU) idle, and (2) fine-grained data permutations that are inefficient on the TPU's coarse-grained memory subsystem. Consequently, porting GPU-optimized HE libraries to TPUs results in severe resource under-utilization and performance degradation. To tackle above challenges, we introduce CROSS, a compiler framework that systematically transforms HE workloads to align with the TPU's architecture. CROSS makes two key contributions: (1) Basis-Aligned Transformation (BAT), a novel technique that converts high-precision modular arithmetic into dense, low-precision (INT8) matrix multiplications, unlocking and improving the utilization of TPU's MXU for HE, and (2) Memory-Aligned Transformation (MAT), which eliminates costly runtime data reordering by embedding reordering into compute kernels through offline parameter transformation. CROSS (TPU v6e) achieves higher throughput per watt on NTT and HE operators than WarpDrive, FIDESlib, FAB, HEAP, and Cheddar, establishing AI ASIC as the SotA efficient platform for HE operators. Code: https://github.com/EfficientPPML/CROSS

翻译：同态加密（HE）为云服务提供了强大的数据隐私保护，但其计算开销巨大。虽然GPU已成为加速HE的实用平台，但与专用（但昂贵）的HE ASIC相比，其能效仍存在数量级差距。本文探索了一条替代路径：利用现有的AI加速器（如谷歌TPU的粗粒度计算与内存架构），为HE实现ASIC级别的能效。然而，这种架构范式与为GPU设计的最先进HE算法存在根本性不匹配。这些算法严重依赖：（1）高精度（32位）整数运算，而TPU的低吞吐量向量单元难以高效执行，导致其高吞吐量低精度（8位）矩阵引擎（MXU）闲置；（2）细粒度数据置换，这在TPU的粗粒度内存子系统中效率低下。因此，将GPU优化的HE库移植到TPU会导致严重的资源利用不足和性能下降。为解决上述挑战，我们提出了CROSS编译器框架，该系统性地转换HE工作负载以适配TPU架构。CROSS做出两项关键贡献：（1）基对齐变换（BAT），这是一种创新技术，可将高精度模运算转换为密集的低精度（INT8）矩阵乘法，从而解锁并提升TPU MXU在HE中的利用率；（2）内存对齐变换（MAT），通过离线参数变换将数据重排嵌入计算内核，从而消除昂贵的运行时数据重排开销。CROSS（TPU v6e）在NTT和HE算子上的每瓦吞吐量优于WarpDrive、FIDESlib、FAB、HEAP和Cheddar，确立了AI ASIC作为HE算子的最先进高效平台。代码：https://github.com/EfficientPPML/CROSS