Since emerging edge applications such as Internet of Things (IoT) analytics and augmented reality have tight latency constraints, hardware AI accelerators have been recently proposed to speed up deep neural network (DNN) inference run by these applications. Resource-constrained edge servers and accelerators tend to be multiplexed across multiple IoT applications, introducing the potential for performance interference between latency-sensitive workloads. In this paper, we design analytic models to capture the performance of DNN inference workloads on shared edge accelerators, such as GPU and edgeTPU, under different multiplexing and concurrency behaviors. After validating our models using extensive experiments, we use them to design various cluster resource management algorithms to intelligently manage multiple applications on edge accelerators while respecting their latency constraints. We implement a prototype of our system in Kubernetes and show that our system can host 2.3X more DNN applications in heterogeneous multi-tenant edge clusters with no latency violations when compared to traditional knapsack hosting algorithms.
翻译:由于诸如物联网(IoT)分析和扩大现实等新兴边缘应用程序存在紧固的延迟度限制,最近提议了人工智能加速器,以加速这些应用程序运行的深神经网络(DNN)推断。资源限制的边缘服务器和加速器往往在多个 IoT 应用程序中被多重驱动,引入了延缓敏感工作量之间性能干扰的可能性。在本文件中,我们设计了分析模型,以捕捉DNN在诸如GPU和边缘TPU等共享边缘加速器上在不同多重加速器和同价行为下的工作负荷的性能。在利用广泛的实验验证了我们的模型之后,我们利用它们设计了各种集束资源管理算法,以便明智地管理边缘加速器上的多种应用,同时尊重它们的延缓限制。我们在Kubernetes实施了我们的系统原型,并表明我们的系统可以在与传统的 knapsack 主控算法相比,将2.3X 更多的DNNNE应用程序存放在混合多强度边缘集群中,且不侵犯延缓度。