The rapid advancements in artificial intelligence (AI), particularly the Large Language Models (LLMs), have profoundly affected our daily work and communication forms. However, the colossal scale of LLM presents significant operational challenges, particularly when attempting to deploy them on resource-constrained edge devices such as smartphones, robots, and embedded systems. In this work, we proposed EdgeLLM, an efficient CPU-FPGA heterogeneous acceleration framework, to markedly enhance the computational efficiency of LLMs on edge. We first analyzed the whole operators within AI models and developed a universal data parallelism scheme, which is generic and can be adapted to any type of AI algorithm. Then, we developed fully-customized hardware operators according to the designated data formats. A multitude of optimization techniques have been integrated in the design, such as approximate FP16*INT4 and FP16*FP16 computation engines, group vector systolic arrays, log-scale structured sparsity, asynchronous between data transfer and processing. Finally, we proposed an end-to-end compilation scheme that can dynamically compile all of the operators and map the whole model on CPU-FPGA heterogeneous system. The design has been deployed on AMD Xilinx VCU128 FPGA, our accelerator achieves 1.67x higher throughput and 7.4x higher energy efficiency than the commercial GPU (NVIDIA A100-SXM4-80G) on ChatGLM2-6B, and shows 10%~20% better performance than state-of-the-art FPGA accelerator of FlightLLM in terms of HBM bandwidth utilization and LLM throughput.
翻译:暂无翻译