The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources. Failing to right-size computing resources results in either latency service level objectives (SLOs) violations or wasted computing resources. Adapting to dynamic workloads considering all the pillars of accuracy, latency, and resource cost is challenging. In response to these challenges, we propose InfAdapter, which proactively selects a set of ML model variants with their resource allocations to meet latency SLO while maximizing an objective function composed of accuracy and cost. InfAdapter decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler (Kubernetes Vertical Pod Autoscaler).
翻译:机器学习推理服务的使用正在迅速增长。机器学习推理服务直接与用户互动,需要快速和准确的响应。此外,这些服务面临着请求的动态负载,需要修改其计算资源。计算资源估算过大或过小将导致延迟服务水平目标(SLO)违规或浪费计算资源。考虑到精度、延迟和资源成本的所有要素以适应动态工作负载,是具有挑战性的。针对这些挑战,我们提出了 InfAdapter,它主动选择一组机器学习模型变量以及它们的资源分配,以满足延迟 SLO,并最大化由准确性和成本组成的目标函数。与一个主流行业自动缩放器(Kubernetes Vertical Pod Autoscaler)相比,InfAdapter可以将 SLO 违规和成本分别降低至 65% 和 33%。