GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources, spatial sharing of GPUs among co-located DNN inference workloads becomes increasingly compelling. However, GPU sharing inevitably brings severe performance interference among co-located inference workloads, as motivated by an empirical measurement study of DNN inference on EC2 GPU instances. While existing works on guaranteeing inference performance service level objectives (SLOs) focus on either temporal sharing of GPUs or reactive GPU resource scaling and inference migration techniques, how to proactively mitigate such severe performance interference has received comparatively little attention. In this paper, we propose iGniter, an interference-aware GPU resource provisioning framework for cost-efficiently achieving predictable DNN inference in the cloud. iGniter is comprised of two key components: (1) a lightweight DNN inference performance model, which leverages the system and workload metrics that are practically accessible to capture the performance interference; (2) A cost-efficient GPU resource provisioning strategy that jointly optimizes the GPU resource allocation and adaptive batching based on our inference performance model, with the aim of achieving predictable performance of DNN inference workloads. We implement a prototype of iGniter based on the NVIDIA Triton inference server hosted on EC2 GPU instances. Extensive prototype experiments on four representative DNN models and datasets demonstrate that iGniter can guarantee the performance SLOs of DNN inference workloads with practically acceptable runtime overhead, while saving the monetary cost by up to 25% in comparison to the state-of-the-art GPU resource provisioning strategies.
翻译:GPU对于加快云中数据中心对延时敏感的深神经网络(DNN)推算工作量至关重要。 要充分利用 GPU 资源, 共同位于 DNN 的推算工作量的空间共享将变得日益令人信服。 然而, GPU 共享不可避免地在共定位推算工作量之间带来严重的性能干扰,这是对 DNN EC2 GPU 实例的推论性能研究所激发的动力。 虽然现有的保证推算性能服务水平目标(SLO)的工作侧重于GPU 或反应性GPU 资源缩放和推断迁移技术的暂时共享。 要充分利用GPU的资源,如何主动减轻这种严重性能干扰的问题却受到相对较少的注意。 在本文件中,我们建议iGNTER 共享一个干扰性能-认知性GPU资源提供框架,以便以成本高效的方式在云中实现可预见性能的DNPER 标准性能模式和工作量衡量标准,从而在实际可获取性能干扰的系统和工作量指标; 成本高效性GPU的G资源配置中进行成本化的GUIG资源配置。