The deployment of inference services at the network edge, called edge inference, offloads computation-intensive inference tasks from mobile devices to edge servers, thereby enhancing the former's capabilities and battery lives. In a multiuser system, the joint allocation of communication-and-computation ($\text{C}^\text{2}$) resources (i.e., scheduling and bandwidth allocation) is made challenging by adopting efficient inference techniques, batching and early exiting, and further complicated by the heterogeneity in users' requirements on accuracy and latency. Batching groups multiple tasks into one batch for parallel processing to reduce time-consuming memory access and thereby boosts the throughput (i.e., completed task per second). On the other hand, early exiting allows a task to exit from a deep-neural network without traversing the whole network to support a tradeoff between accuracy and latency. In this work, we study optimal $\text{C}^\text{2}$ resource allocation with batching and early exiting, which is an NP-complete integer programming problem. A set of efficient algorithms are designed under the criterion of maximum throughput by tackling the challenge. Experimental results demonstrate that both optimal and sub-optimal $\text{C}^\text{2}$ resource allocation algorithms can leverage integrated batching and early exiting to double the inference throughput compared with conventional schemes.
翻译:在网络边缘部署推论服务,称为边缘推算,卸载从移动设备到边缘服务器的计算密集型推论任务,从而增强前者的能力和电池寿命。在多用户系统中,联合分配通信和计算资源(即,时间安排和带宽分配)具有挑战性,因为采用了有效的推论技术,分批和提前退出,并由于用户对准确性和耐久性的要求存在差异而进一步复杂化。将多重任务分组为一组平行处理,以减少耗时内存访问,从而提升过量(即,每秒完成任务)。另一方面,早期退出可以让一项任务从深度神经网络退出,而无需对整个网络进行穿刺,以支持在准确性和耐久性之间实现交易。在这项工作中,我们研究与分批和早期退出有关的用户资源配置最优 $text{C\xxxxxx}资源配置,这是一个NP-完全的整整数内存存存存存取,从而提升透支量(即每秒完成任务) 。一套高效的算算方法是根据资源配置标准设计,通过最优化的排序,通过最优化的后算方法,通过最难度标准来展示。