Deep neural networks (DNNs) have been widely adopted for various mobile inference tasks, yet their ever-increasing computational demands are hindering their deployment on resource-constrained mobile devices. Hybrid deep learning partitions a DNN into two parts and deploys them across the mobile device and a server, aiming to reduce inference latency or prolong battery life of mobile devices. However, such partitioning produces (non-uniform) DNN fragments which are hard to serve efficiently on the server.This paper presents Graft -- an efficient inference serving system for hybrid deep learning with latency service-level objective (SLO) guarantees. Our main insight is to mitigate the non-uniformity by a core concept called DNN re-alignment, allowing multiple heterogeneous DNN fragments to be restructured to share layers. To fully exploit the potential of DNN re-alignment, Graft employs fine-grained GPU resource sharing. Based on that, we propose efficient algorithms for merging, grouping, and re-aligning DNN fragments to maximize request batching opportunities, minimizing resource consumption while guaranteeing the inference latency SLO. We implement a Graft prototype and perform extensive experiments with five types of widely used DNNs and real-world network traces. Our results show that Graft improves resource efficiency by up to 70% compared with the state-of-the-art inference serving systems.
翻译:暂无翻译