TZ-LLM：基于Arm TrustZone保护设备端大型语言模型 (TZ-LLM: Protecting On-Device Large Language Models with Arm TrustZone)

Large Language Models (LLMs) deployed on mobile devices offer benefits like user privacy and reduced network latency, but introduce a significant security risk: the leakage of proprietary models to end users. To mitigate this risk, we propose a system design for protecting on-device LLMs using Arm Trusted Execution Environment (TEE), TrustZone. Our system addresses two primary challenges: (1) The dilemma between memory efficiency and fast inference (caching model parameters within TEE memory). (2) The lack of efficient and secure Neural Processing Unit (NPU) time-sharing between Rich Execution Environment (REE) and TEE. Our approach incorporates two key innovations. First, we employ pipelined restoration, leveraging the deterministic memory access patterns of LLM inference to prefetch parameters on demand, hiding memory allocation, I/O and decryption latency under computation time. Second, we introduce a co-driver design, creating a minimal data plane NPU driver in the TEE that collaborates with the full-fledged REE driver. This reduces the TEE TCB size and eliminates control plane reinitialization overhead during NPU world switches. We implemented our system on the emerging OpenHarmony OS and the llama.cpp inference framework, and evaluated it with various LLMs on an Arm Rockchip device. Compared to a strawman TEE baseline lacking our optimizations, our system reduces TTFT by up to 90.9% and increases decoding speed by up to 23.2%.

翻译：部署于移动设备的大型语言模型（LLMs）虽能带来用户隐私保护和降低网络延迟等优势，但也引入了显著的安全风险：专有模型可能泄露给终端用户。为缓解此风险，我们提出一种基于Arm可信执行环境（TEE）TrustZone的设备端LLMs保护系统设计。该系统主要应对两大挑战：（1）内存效率与快速推理之间的权衡（在TEE内存中缓存模型参数）。（2）富执行环境（REE）与TEE之间缺乏高效安全的神经处理单元（NPU）分时共享机制。我们的方法包含两项关键创新。首先，采用流水线式参数恢复技术，利用LLM推理过程中确定性的内存访问模式按需预取参数，将内存分配、I/O和解密延迟隐藏于计算时间内。其次，引入协同驱动设计，在TEE内创建极简数据平面NPU驱动，与功能完整的REE驱动协同工作。这减少了TEE可信计算基（TCB）规模，并消除了NPU域切换时的控制平面重初始化开销。我们在新兴的OpenHarmony操作系统和llama.cpp推理框架上实现了该系统，并在Arm Rockchip设备上使用多种LLMs进行评估。与未采用优化的基础TEE方案相比，本系统将首词元生成时间（TTFT）降低最高达90.9%，并将解码速度提升最高达23.2%。