Edge inference has become more widespread, as its diverse applications range from retail to wearable technology. Clusters of networked resource-constrained edge devices are becoming common, yet no system exists to split a DNN across these clusters while maximizing the inference throughput of the system. We present an algorithm which partitions DNNs and distributes them across a set of edge devices with the goal of minimizing the bottleneck latency and therefore maximizing inference throughput. The system scales well to systems of different node memory capacities and numbers of nodes. We find that we can reduce the bottleneck latency by 10x over a random algorithm and 35% over a greedy joint partitioning-placement algorithm. Furthermore we find empirically that for the set of representative models we tested, the algorithm produces results within 9.2% of the optimal bottleneck latency.
翻译:电磁推论已变得更加广泛,因为其各种应用从零售到可磨损技术不等。 网络资源限制边缘装置的集群正在变得普遍, 但还没有一个系统可以将一个DNN分成这些集群,同时最大限度地扩大系统的推论通过量。 我们提出了一个算法,将DNN分隔开来,将其分布在一组边缘装置上,目的是最大限度地减少瓶颈延缓度,从而最大限度地增加推论通过量。 系统在不同的节点内存能力和节点数系统中分布得非常宽。 我们发现,我们可以将瓶颈延缓度减少10x, 减少35%, 超过一个随机算法, 减少贪婪联合分配-配置算法。 此外,我们从经验中发现,对于一组具有代表性的模型,我们测试的这一算法在最佳瓶内延缓度9.2%的范围内产生结果。