ReaSER:优化利用短云资源的加强学习战略 (ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization Of Ephemeral Cloud Resources)

Cloud data center capacities are over-provisioned to handle demand peaks and hardware failures which leads to low resources' utilization. One way to improve resource utilization and thus reduce the total cost of ownership is to offer unused resources (referred to as ephemeral resources) at a lower price. However, reselling resources needs to meet the expectations of its customers in terms of Quality of Service. The goal is so to maximize the amount of reclaimed resources while avoiding SLA penalties. To achieve that, cloud providers have to estimate their future utilization to provide availability guarantees. The prediction should consider a safety margin for resources to react to unpredictable workloads. The challenge is to find the safety margin that provides the best trade-off between the amount of resources to reclaim and the risk of SLA violations. Most state-of-the-art solutions consider a fixed safety margin for all types of metrics (e.g., CPU, RAM). However, a unique fixed margin does not consider various workloads variations over time which may lead to SLA violations or/and poor utilization. In order to tackle these challenges, we propose ReLeaSER, a Reinforcement Learning strategy for optimizing the ephemeral resources' utilization in the cloud. ReLeaSER dynamically tunes the safety margin at the host-level for each resource metric. The strategy learns from past prediction errors (that caused SLA violations). Our solution reduces significantly the SLA violation penalties on average by 2.7x and up to 3.4x. It also improves considerably the CPs' potential savings by 27.6% on average and up to 43.6%.

翻译：处理需求高峰和导致资源利用率低的硬件故障的云中数据中心能力被过度用于处理需求高峰和硬件故障,从而导致资源利用率低; 改进资源利用并从而减少所有权总成本的一个办法是以较低的价格提供未使用资源(称为短暂资源),但以较低的价格提供(称为短暂资源),然而,转售资源需要满足客户对服务质量的期望。目标是最大限度地增加回收资源的数量,同时避免对苏丹解放军的处罚。要做到这一点,云中供应商必须估计其未来利用情况,以提供供应保证。预测应考虑资源对无法预测的工作量作出反应的安全幅度。要应对这些挑战,我们建议ReLeaser, 在回收资源数量与违反苏丹解放军行为风险之间找到最佳交换的安全幅度。多数最先进的解决方案需要考虑各类衡量标准(如CPU、RAM)的固定安全幅度。然而,一个独特的固定利润幅度并不考虑到一段时间内各种工作量的变化,可能导致SLA的违约或/和利用率差。为了应对这些挑战,我们建议ReLASSER, 强化学习战略,从云中为回收资源利用率的平均值。