With the fast development of deep neural networks (DNNs), many real-world applications are adopting multiple models to conduct compound tasks, such as co-running classification, detection, and segmentation models on autonomous vehicles. Such multi-tenant DNN inference cases greatly exacerbate the computational complexity and call for comprehensive collaboration for graph-level operator scheduling, runtime-level resource awareness, as well as hardware scheduler support. However, the current scheduling support for such multi-tenant inference is still relatively backward. In this work, we propose a resource-aware scheduling framework for efficient multi-tenant DNN inference on GPU, which automatically coordinates DNN computing in different execution levels. Leveraging the unified scheduling intermediate representation and the automated ML-based searching algorithm, optimal schedules could be generated to wisely adjust model concurrency and interleave DNN model operators, maintaining a continuously balanced resource utilization across the entire inference process, and eventually improving the runtime efficiency. Experiments show that we could consistently achieve 1.3-1.7x speed-up, compared to regular DNN runtime libraries (e.g., CuDNN, TVM) and particular concurrent scheduling methods (e.g., NVIDIA Multi-Stream).
翻译:随着深层神经网络(DNNs)的快速发展,许多现实世界应用程序正在采用多种模型来开展复合任务,如在自主车辆上共同运行的分类、检测和分离模型。这类多租租户 DNN 推断案例极大地加剧了计算复杂性,要求为图形操作员的时间安排、运行时间层面的资源意识以及硬件调度器支持进行全面合作。然而,目前对这种多租期推论的时间安排支持仍然相对落后。在这项工作中,我们建议为高效的多租期 DNNN 在GPU上进行多租期多租期推论制定资源列表框架,该框架自动协调DNN在不同执行级别上的计算。利用统一的调度中间代表制和基于 ML 自动搜索算法,可以生成最佳时间表,以明智地调整模型的调制通、间断 DNNNN的模型操作员,在整个推算过程中保持持续平衡的资源利用,并最终提高运行时间效率。实验表明,与正常的 DNNNT运行时图书馆(e.g.DNNNN、TVM-M)相比,我们可以持续实现1.3-1.7x速度增速。