Future improvements in large language model (LLM) services increasingly hinge on access to high-value professional knowledge rather than more generic web data. However, the data providers of this knowledge face a skewed tradeoff between income and risk: they receive little share of downstream value yet retain copyright and privacy liability, making them reluctant to contribute their assets to LLM services. Existing techniques do not offer a trustworthy and controllable way to use professional knowledge, because they keep providers in the dark and combine knowledge parameters with the underlying LLM backbone. In this paper, we present PKUS, the Professional Knowledge Utilization System, which treats professional knowledge as a first-class, separable artifact. PKUS keeps the backbone model on GPUs and encodes each provider's contribution as a compact adapter that executes only inside an attested Trusted Execution Environment (TEE). A hardware-rooted lifecycle protocol, adapter pruning, multi-provider aggregation, and split-execution scheduling together make this design practical at serving time. On SST-2, MNLI, and SQuAD with GPT-2 Large and Llama-3.2-1B, PKUS preserves model utility, matching the accuracy and F1 of full fine-tuning and plain LoRA, while achieving the lowest per-request latency with 8.1-11.9x speedup over CPU-only TEE inference and naive CPU-GPU co-execution.
翻译:未来大语言模型服务的提升将日益依赖于获取高价值的专业知识,而非通用的网络数据。然而,此类知识的数据提供方面临着收益与风险失衡的困境:他们仅能分享极少的下游价值,却仍需承担版权与隐私责任,导致其不愿将资产贡献给大语言模型服务。现有技术未能提供可信且可控的专业知识利用方式,因其使提供方处于信息黑箱状态,并将知识参数与底层大语言模型主干网络深度融合。本文提出专业知识利用系统,将专业知识视为独立的一级可分离构件。该系统将主干模型保留在GPU上,并将各提供方的贡献编码为紧凑的适配器,该适配器仅在经过认证的可信执行环境中执行。通过硬件根植的生命周期协议、适配器剪枝、多提供方聚合与分步执行调度机制的协同设计,本方案在实际服务场景中具备可行性。在SST-2、MNLI和SQuAD数据集上,基于GPT-2 Large与Llama-3.2-1B模型的实验表明,本系统在保持模型效用的同时,其准确率与F1分数与全参数微调及原始LoRA方法相当,并以8.1-11.9倍的加速比超越纯CPU TEE推理及原始CPU-GPU协同执行方案,实现了最低的单请求延迟。