小型语言模型的边缘部署：CPU、GPU与NPU后端综合性能对比 (Edge Deployment of Small Language Models, a comprehensive comparison of CPU, GPU and NPU backends)

Edge computing processes data where it is generated, enabling faster decisions, lower bandwidth usage, and improved privacy. However, edge devices typically operate under strict constraints on processing power, memory, and energy consumption, making them unsuitable for large language models (LLMs). Fortunately, Small Language Models (SLMs) offer lightweight alternatives that bring AI inference to resource-constrained environments by significantly reducing computational cost while remaining suitable for specialization and customization. In this scenario, selecting the hardware platform that best balances performance and efficiency for SLM inference is challenging due to strict resource limitations. To address this issue, this study evaluates the inference performance and energy efficiency of commercial CPUs (Intel and ARM), GPUs (NVIDIA), and NPUs (RaiderChip) for running SLMs. GPUs, the usual platform of choice, are compared against commercial NPUs and recent multi-core CPUs. While NPUs leverage custom hardware designs optimized for computation, modern CPUs increasingly incorporate dedicated features targeting language-model workloads. Using a common execution framework and a suite of state-of-the-art SLMs, we analyze both maximum achievable performance and processing and energy efficiency across commercial solutions available for each platform. The results indicate that specialized backends outperform general-purpose CPUs, with NPUs achieving the highest performance by a wide margin. Bandwidth normalization proves essential for fair cross-architecture comparisons. Although low-power ARM processors deliver competitive results when energy usage is considered, metrics that combine performance and power (such as EDP) again highlight NPUs as the dominant architecture. These findings show that designs optimized for both efficiency and performance offer a clear advantage for edge workloads.

翻译：边缘计算在数据生成处进行处理，能够实现更快的决策、降低带宽占用并提升隐私保护水平。然而，边缘设备通常受限于严格的处理能力、内存和能耗约束，难以部署大型语言模型（LLMs）。幸运的是，小型语言模型（SLMs）提供了轻量级替代方案，通过显著降低计算成本，同时保持专业化和定制化能力，将人工智能推理引入资源受限环境。在此背景下，由于严格的资源限制，选择能够在SLM推理中最佳平衡性能与能效的硬件平台具有挑战性。为解决这一问题，本研究评估了商用CPU（Intel与ARM）、GPU（NVIDIA）和NPU（RaiderChip）在运行SLM时的推理性能与能效表现。作为常规选择平台的GPU，与商用NPU及近期多核CPU进行了对比分析。NPU利用针对计算优化的定制硬件设计，而现代CPU则日益集成面向语言模型工作负载的专用功能。通过采用统一的执行框架和一系列先进SLM模型，我们分析了各平台商用解决方案的最大可达性能、处理效率及能效表现。结果表明，专用后端优于通用CPU，其中NPU以显著优势实现最高性能。带宽归一化被证明是实现跨架构公平比较的关键。尽管低功耗ARM处理器在考虑能耗时表现出竞争力，但综合性能与功耗的指标（如EDP）再次突显NPU作为主导架构的优势。这些发现表明，针对能效与性能双重优化的设计在边缘工作负载中具有明显优势。