The operational cost of a cloud computing platform is one of the most significant Quality of Service (QoS) criteria for schedulers, crucial to keep up with the growing computational demands. Several data-driven deep neural network (DNN)-based schedulers have been proposed in recent years that outperform alternative approaches by providing scalable and effective resource management for dynamic workloads. However, state-of-the-art schedulers rely on advanced DNNs with high computational requirements, implying high scheduling costs. In non-stationary contexts, the most sophisticated schedulers may not always be required, and it may be sufficient to rely on low-cost schedulers to temporarily save operational costs. In this work, we propose MetaNet, a surrogate model that predicts the operational costs and scheduling overheads of a large number of DNN-based schedulers and chooses one on-the-fly to jointly optimize job scheduling and execution costs. This facilitates improvements in execution costs, energy usage and service level agreement violations of up to 11%, 43% and 13% compared to the state-of-the-art methods.
翻译:云计算平台的运行成本是排程员最重要的服务质量标准之一,对于跟上不断增长的计算需求至关重要。一些以数据驱动的深神经网络(DNN)为基础的排程器近年来被提议为动态工作量提供可扩缩和有效的资源管理,从而优于其他方法;然而,最先进的排程员依赖先进的DNN, 计算要求很高, 意味着排程费用高。在非固定情况下,可能并不总是需要最复杂的排程员,而且可能足够依靠低成本的排程员来暂时节省运行费用。在这项工作中,我们提议了MetaNet这一代用模型,用以预测大量基于DNN的排程员的运行成本和排程管理间接费用,并选择一个即时操作员来联合优化工作时间安排和执行费用。这有利于改进执行成本、能源使用和服务级别协议违反率达11%、43%和13%,而采用最先进的方法。