Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy efficiency of an NVIDIA Jetson AGX Orin, establishing a practical approach for efficient LLM inference on edge platforms.
翻译:近年来,大语言模型的快速发展已超越主要依赖CPU的边缘平台的计算与内存容量,对高效且可扩展的部署提出了挑战。尽管三元量化能显著节省资源,但现有的CPU解决方案严重依赖基于内存的查找表,这限制了可扩展性,而FPGA或GPU加速器对于边缘应用仍不切实际。本文提出了T-SAR,这是首个通过重新利用SIMD寄存器文件进行动态、寄存器内查找表生成(仅需最小硬件修改)在CPU上实现可扩展三元大语言模型推理的框架。T-SAR消除了内存瓶颈并最大化数据级并行性,在SIMD单元仅增加3.2%的功耗和1.4%的面积开销下,分别实现了GEMM延迟5.6-24.5倍和GEMV吞吐量1.1-86.2倍的提升。T-SAR的能效最高可达NVIDIA Jetson AGX Orin的2.5-4.9倍,为边缘平台上的高效大语言模型推理确立了一种实用方法。