Optimizing resource allocation for analytical workloads is vital for reducing costs of cloud-data services. At the same time, it is incredibly hard for users to allocate resources per query in serverless processing systems, and they frequently misallocate by orders of magnitude. Unfortunately, prior work focused on predicting peak allocation while ignoring aggressive trade-offs between resource allocation and run-time. Additionally, these methods fail to predict allocation for queries that have not been observed in the past. In this paper, we tackle both these problems. We introduce a system for optimal resource allocation that can predict performance with aggressive trade-offs, for both new and past observed queries. We introduce the notion of a performance characteristic curve (PCC) as a parameterized representation that can compactly capture the relationship between resources and performance. To tackle training data sparsity, we introduce a novel data augmentation technique to efficiently synthesize the entire PCC using a single run of the query. Lastly, we demonstrate the advantages of a constrained loss function coupled with GNNs, over traditional ML methods, for capturing the domain specific behavior through an extensive experimental evaluation over SCOPE big data workloads at Microsoft.
翻译:优化用于分析工作量的资源分配对于降低云层数据服务的成本至关重要。 同时,用户很难在没有服务器的处理系统中为每个查询分配资源,而且经常按数量顺序分配。 不幸的是,先前的工作侧重于预测峰值分配,而忽视资源分配与运行时间之间的激烈权衡。此外,这些方法无法预测过去未曾观察到的查询分配情况。在本文件中,我们处理这两个问题。我们引入了一个最佳资源分配系统,可以预测业绩,对新的和过去观察到的查询进行激烈的权衡。我们引入了性能特征曲线的概念,作为参数化的表示,可以紧凑地捕捉资源与业绩之间的关系。为了解决数据偏狭问题,我们引入了一种新的数据增强技术,以便利用单一的查询方式有效地合成整个PCC。最后,我们展示了与全球净值相比的有限损失功能以及传统的ML方法的优势,通过对微软SCOPE的大型数据工作量进行广泛的实验性评估来捕捉具体域行为。