Problem Definition: Allocating sufficient capacity to cloud services is a challenging task, especially when demand is time-varying, heterogeneous, contains batches, and requires multiple types of resources for processing. In this setting, providers decide whether to reserve portions of their capacity to individual job classes or to offer it in a flexible manner. Methodology/results: In collaboration with Huawei Cloud, a worldwide provider of cloud services, we propose a heuristic policy that allocates multiple types of resources to jobs and also satisfies their pre-specified service level agreements (SLAs). We model the system as a multi-class queueing network with parallel processing and multiple types of resources, where arrivals (i.e., virtual machines and containers) follow time-varying patterns and require at least one unit of each resource for processing. While virtual machines leave if they are not served immediately, containers can join a queue. We introduce a diffusion approximation of the offered load of such system and investigate its fidelity as compared to the observed data. Then, we develop a heuristic approach that leverages this approximation to determine capacity levels that satisfy probabilistic SLAs in the system with fully flexible servers. Managerial Implications: Using a data set of cloud computing requests over a representative 8-day period from Huawei Cloud, we show that our heuristic policy results in a 20% capacity reduction and better service quality as compared to a benchmark that reserves resources. In addition, we show that the system utilization induced by our policy is superior to the benchmark, i.e., it implies less idling of resources in most instances. Thus, our approach enables cloud operators to both reduce costs and achieve better performance.
翻译:问题定义:为云服务分配足够能力是一项具有挑战性的任务,特别是当需求是时间变化的、多样化的、包含批量的、需要多种处理资源的多种类型的资源时。在这一背景下,供应商决定是否将其能力的一部分保留到单个工作类别,或者灵活提供。 方法/结果:与世界范围的云服务供应商华伟云合作,我们提出了将多种类型的资源分配到工作岗位并满足其事先指定的服务级别协议(SLAs)的超常政策。我们将该系统建为多级排队网络,同时处理和多种类型的资源。在这种网络中,到达者(即虚拟机器和集装箱)遵循时间变化模式,要求每个处理资源至少有一个单位。虚拟机器如果不能立即服务,集装箱可以加入排队。我们对所提供的系统负荷进行推广近似近似,并调查其与所观察到的数据的准确性。然后,我们开发了超常度方法,利用这种近似方法来确定能力水平,以降低系统稳定度和多种类型的资源,使系统(即虚拟操作者)达到时间变化模式模式模式模式模式模式模式,并需要至少一个单位单位处理处理处理。虚拟机器运行一个单位。虚拟机器运行一个更灵活的政策要求显示一个更灵活的政策。