Resource sharing between multiple workloads has become a prominent practice among cloud service providers, motivated by demand for improved resource utilization and reduced cost of ownership. Effective resource sharing, however, remains an open challenge due to the adverse effects that resource contention can have on high-priority, user-facing workloads with strict Quality of Service (QoS) requirements. Although recent approaches have demonstrated promising results, those works remain largely impractical in public cloud environments since workloads are not known in advance and may only run for a brief period, thus prohibiting offline learning and significantly hindering online learning. In this paper, we propose RAPID, a novel framework for fast, fully-online resource allocation policy learning in highly dynamic operating environments. RAPID leverages lightweight QoS predictions, enabled by domain-knowledge-inspired techniques for sample efficiency and bias reduction, to decouple control from conventional feedback sources and guide policy learning at a rate orders of magnitude faster than prior work. Evaluation on a real-world server platform with representative cloud workloads confirms that RAPID can learn stable resource allocation policies in minutes, as compared with hours in prior state-of-the-art, while improving QoS by 9.0x and increasing best-effort workload performance by 19-43%.
翻译:多个工作负载之间的资源共享已成为云服务提供商之间的重要实践,这是为了改善资源利用率并降低所有权成本。然而,有效的资源共享仍然是一个开放性挑战,因为资源争用可能对具有严格服务质量(QoS)要求的高优先级用户界面工作负载产生不利影响。虽然最近的方法取得了有前途的结果,但由于工作负载未知并且可能仅运行短时间,因此离线学习受到限制,极大地阻碍了在线学习。在本文中,我们提出了RAPID,这是一种快速、完全在线的资源分配策略学习框架,在高度动态的操作环境中使用。RAPID利用轻量级QoS预测,通过启用领域知识启发的技术以提高样本效率和减少偏差,将控制从传统反馈源中解耦,以远高于之前工作的速率指导策略学习。使用代表性云工作负载进行的在真实世界服务器平台上的评估确认,RAPID可以在几分钟内学习稳定的资源分配策略,而之前最先进的方法需要数小时,同时将QoS提高了9.0倍,并将最好尽力工作负载的性能提高了19-43%。