Serverless computing has emerged as a pivotal paradigm for deploying Deep Learning (DL) models, offering automatic scaling and cost efficiency. However, the inherent cold start problem in serverless ML inference systems, particularly the time-consuming model loading process, remains a significant bottleneck. Utilizing pipelined model loading improves efficiency but still suffer from pipeline stalls due to sequential layer construction and monolithic weight loading. In this paper, we propose \textit{Cicada}, a novel pipeline optimization framework that coordinates computational, storage, and scheduling resources through three key mechanisms: (1) \textit{MiniLoader}: which reduces layer construction overhead by opportunistically optimizing parameter initialization; (2) \textit{WeightDecoupler}: decoupling weight file processing from layer construction, enabling asynchronous weight retrieval and out-of-order weight application; (3) \textit{Priority-Aware Scheduler}: dynamically allocating resources to ensure high-priority inference tasks are executed promptly. Our experimental results demonstrate that Cicada achieves significant performance improvements over the state-of-the-art PISeL framework. Specifically, Cicada reduces end-to-end inference latency by an average of 61.59\%, with the MiniLoader component contributing the majority of this optimization (53.41\%), and the WeightDecoupler achieves up to 26.17\% improvement. Additionally, Cicada achieves up to 2.52x speedup in the inference pipeline utlization compared to PISeL.
翻译:暂无翻译