By provisioning inference offloading services, edge inference drives the rapid growth of AI applications at network edge. However, how to reduce the inference latency remains a significant challenge. To address this issue, we develop a parameter-sharing AI model loading (PartialLoading) framework for multi-user edge inference, which exploits two key insights: 1) the majority of latency arises from loading AI models into server GPU memory, and 2) different AI models can share a significant number of parameters, for which redundant loading should be avoided. Towards this end, we formulate a joint multi-user scheduling and spectrum bandwidth allocation problem to maximize task throughput by exploiting shared parameter blocks across models. The intuition is to judiciously schedule user requests to reuse the shared parameter blocks between consecutively loaded models, thereby reducing model loading time substantially. To facilitate solution finding, we decouple the problem into two sub-problems, i.e., user scheduling and bandwidth allocation, showing that solving them sequentially leads to the solution to the original problem. Due to the NP-hardness of the problem, we first study an important special case called the "backbone-sharing" case, and design a dynamic programming-based algorithm to obtain the optimal solution in polynomial time. For the general case, we propose a greedy heuristic to obtain the sub-optimal solution efficiently. Simulation results demonstrate that the proposed framework significantly improves task throughput under deadline constraints compared with user scheduling without exploiting parameter sharing.
翻译:通过提供推理卸载服务,边缘推理推动了网络边缘AI应用的快速增长。然而,如何降低推理延迟仍然是一个重大挑战。为解决这一问题,我们提出了一种面向多用户边缘推理的参数共享AI模型加载(PartialLoading)框架,该框架基于两个关键观察:1)大部分延迟源于将AI模型加载到服务器GPU内存的过程;2)不同的AI模型可以共享大量参数,应避免对这些参数进行冗余加载。为此,我们构建了一个联合多用户调度与频谱带宽分配问题,旨在通过利用模型间的共享参数块来最大化任务吞吐量。其核心思想是:通过合理调度用户请求,使连续加载的模型能够复用共享参数块,从而显著减少模型加载时间。为便于求解,我们将原问题解耦为用户调度和带宽分配两个子问题,并证明顺序求解这两个子问题即可得到原问题的解。由于该问题是NP难问题,我们首先研究了一个称为“骨干共享”的重要特例,并设计了一种基于动态规划的算法,可在多项式时间内获得最优解。针对一般情况,我们提出了一种贪心启发式算法以高效获得次优解。仿真结果表明,与未利用参数共享的用户调度方法相比,所提框架在截止时间约束下能显著提升任务吞吐量。