Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive. Therefore, the demand is growing to make them answer a heavy workload of requests with available computational resources. Unlike recent initiatives on inference servers and inference frameworks, which focus on the prediction of single DNNs, we propose a new software layer to serve with flexibility and efficiency ensembles of DNNs. Our inference system is designed with several technical innovations. First, we propose a novel procedure to find a good allocation matrix between devices (CPUs or GPUs) and DNN instances. It runs successively a worst-fit to allocate DNNs into the memory devices and a greedy algorithm to optimize allocation settings and speed up the ensemble. Second, we design the inference system based on multiple processes to run asynchronously: batching, prediction, and the combination rule with an efficient internal communication scheme to avoid overhead. Experiments show the flexibility and efficiency under extreme scenarios: It successes to serve an ensemble of 12 heavy DNNs into 4 GPUs and at the opposite, one single DNN multi-threaded into 16 GPUs. It also outperforms the simple baseline consisting of optimizing the batch size of DNNs by a speedup up to 2.7X on the image classification task.
翻译:深神经网络(DNNS)的组合已经实现了定性预测,但是它们正在计算和记忆密集。因此,需求不断增长,要求它们用可用的计算资源应对繁重的请求工作量。与最近关于推断服务器和推断框架的举措不同,我们提议一个新的软件层,以灵活和高效的方式为DNNS的组合服务。我们的推论系统是用若干技术创新设计的。首先,我们提议了一个新程序,在设备(CPU或GPUs)和DNN实例之间找到一个良好的分配矩阵。将DNNS分配到记忆装置和贪婪算法,以优化分配设置和加快共性。第二,我们设计了一个基于多个过程的推断系统,以同步的方式运行:编组、预测和结合高效的内部通信计划,以避免间接费用。实验显示极端假设下的灵活性和效率:成功地将12个重的DNNPS(或GPUs)的组合变成4个GPNS(GPNS)的组合组合,在最短的GNPS(GNPNS)格式上,在最短的1个方向上,由GNNNNS(GPNS)的1个成最短的1个标准,在最短的DNUS,在最短的GNUS的1个方向上,在最短的GNNUS的1的1个标准的1个方向上,在最短的1个方向上,在最短的1个方向上,在最短的1个方向上,在最短的1个方向上,在最短的GNPNPNPNPUSFS。