Serving machine learning inference workloads on the cloud is still a challenging task on the production level. Optimal configuration of the inference workload to meet SLA requirements while optimizing the infrastructure costs is highly complicated due to the complex interaction between batch configuration, resource configurations, and variable arrival process. Serverless computing has emerged in recent years to automate most infrastructure management tasks. Workload batching has revealed the potential to improve the response time and cost-effectiveness of machine learning serving workloads. However, it has not yet been supported out of the box by serverless computing platforms. Our experiments have shown that for various machine learning workloads, batching can hugely improve the system's efficiency by reducing the processing overhead per request. In this work, we present MLProxy, an adaptive reverse proxy to support efficient machine learning serving workloads on serverless computing systems. MLProxy supports adaptive batching to ensure SLA compliance while optimizing serverless costs. We performed rigorous experiments on Knative to demonstrate the effectiveness of MLProxy. We showed that MLProxy could reduce the cost of serverless deployment by up to 92% while reducing SLA violations by up to 99% that can be generalized across state-of-the-art model serving frameworks.
翻译:由于批量配置、资源配置和可变抵达过程之间的复杂互动关系,满足服务机在云层上学习推算工作量的最佳推算工作量配置非常复杂。由于批量配置、资源配置和可变抵达过程之间的复杂互动关系,因此,在优化大多数基础设施管理任务方面,近年来出现了无服务器计算。工作量的批发表明,有可能改进机器学习工作量的响应时间和成本效益。然而,没有服务器的计算机平台尚未支持这项工作。我们的实验表明,对各种机器学习工作量而言,分批可大幅提高系统的效率,减少每次请求的处理间接费用。在此工作中,我们介绍一个适应性反向代理MLProxy,以支持高效的机器学习,为无服务器计算系统的工作量服务。MLProxy支持调整批量,以确保服务机的合规,同时优化无服务器的成本。我们在Knative上进行了严格的实验,以证明MLProxy的有效性。我们显示,MLProxy可以降低服务器无功能模式部署的成本,将服务器模型的运行成本降低到92%,同时将常规框架降低到99 %。