As machine learning techniques are applied to a widening range of applications, high throughput machine learning (ML) inference servers have become critical for online service applications. Such ML inference servers pose two challenges: first, they must provide a bounded latency for each request to support consistent service-level objective (SLO), and second, they can serve multiple heterogeneous ML models in a system as certain tasks involve invocation of multiple models and consolidating multiple models can improve system utilization. To address the two requirements of ML inference servers, this paper proposes a new ML inference scheduling framework for multi-model ML inference servers. The paper first shows that with SLO constraints, current GPUs are not fully utilized for ML inference tasks. To maximize the resource efficiency of inference servers, a key mechanism proposed in this paper is to exploit hardware support for spatial partitioning of GPU resources. With the partitioning mechanism, a new abstraction layer of GPU resources is created with configurable GPU resources. The scheduler assigns requests to virtual GPUs, called gpu-lets, with the most effective amount of resources. The paper also investigates a remedy for potential interference effects when two ML tasks are running concurrently in a GPU. Our prototype implementation proves that spatial partitioning enhances throughput by 102.6% on average while satisfying SLOs.
翻译:随着机器学习技术应用于日益扩大的应用范围,高排量机器学习(ML)推断服务器已成为在线服务应用程序的关键。这种ML推断服务器提出了两个挑战:第一,它们必须为每一项请求提供封闭的延迟时间以支持一致的服务级目标(SLO),第二,它们可以在一个系统中为多种不同的 ML模型服务,因为某些任务涉及使用多种模型和合并多种模型可以改进系统的利用。为了满足ML推断服务器的两项要求,本文件提议为多模模模ML推断服务器建立一个新的 ML推断表框架。该文件首先显示,在 SLO 限制下,目前的GPU没有被充分利用于 ML 推断任务。为了最大限度地提高推断服务器的资源效率,本文件建议的关键机制是利用硬件支持GPU资源的空间偏移,利用隔断机制,用可配置的GPU资源创建新的GPU资源抽象层。调度器向虚拟GPU发送了请求,在SL的制约下,当前GPUPU没有被充分利用,同时通过我们的平均干涉度的SPU资源,通过SAP 进行一项稳定地压压压压压式的SU 。