Fully homomorphic encryption (FHE) enables secure computation on encrypted data, mitigating privacy concerns in cloud and edge environments. However, due to its high compute and memory demands, extensive acceleration research has been pursued across diverse hardware platforms, especially GPUs. In this paper, we perform a microarchitectural analysis of CKKS, a popular FHE scheme, on modern GPUs. We focus on on-chip cache behavior, and show that the dominant kernels remain bound by memory bandwidth despite a high-bandwidth L2 cache, exposing a persistent memory wall. We further discover that the overall CKKS pipeline throughput is constrained by low per-kernel hardware utilization, caused by insufficient intra-kernel parallelism. Motivated by these findings, we introduce Theodosian, a set of complementary, memory-aware optimizations that improve cache efficiency and reduce runtime overheads. Our approach delivers consistent speedups across various CKKS workloads. On an RTX 5090, we reduce the bootstrapping latency for 32,768 complex numbers to 15.2ms with Theodosian, and further to 12.8ms with additional algorithmic optimizations, establishing new state-of-the-art GPU performance to the best of our knowledge.
翻译:全同态加密(FHE)支持在加密数据上直接进行安全计算,有效缓解了云与边缘环境中的隐私顾虑。然而,由于其极高的计算与内存需求,学术界已在多种硬件平台(尤其是GPU)上开展了广泛的加速研究。本文针对现代GPU平台上流行的CKKS全同态加密方案进行了微架构分析。我们聚焦于片上缓存行为,发现尽管具备高带宽的L2缓存,其核心计算内核仍受限于内存带宽,这揭示了持续存在的内存墙问题。我们进一步发现,由于内核内部并行度不足导致的单内核硬件利用率低下,制约了整体CKKS流水线的吞吐量。基于这些发现,我们提出了Theodosian——一套互补的内存感知优化方案,通过提升缓存效率与降低运行时开销实现加速。该方法在多种CKKS工作负载上均能带来稳定的性能提升。在RTX 5090平台上,Theodosian将32,768个复数的自举延迟降低至15.2毫秒,结合额外算法优化后可进一步降至12.8毫秒。据我们所知,该成果创造了当前GPU性能的最新纪录。