Big data processing at the production scale presents a highly complex environment for resource optimization (RO), a problem crucial for meeting performance goals and budgetary constraints of analytical users. The RO problem is challenging because it involves a set of decisions (the partition count, placement of parallel instances on machines, and resource allocation to each instance), requires multi-objective optimization (MOO), and is compounded by the scale and complexity of big data systems while having to meet stringent time constraints for scheduling. This paper presents a MaxCompute-based integrated system to support multi-objective resource optimization via fine-grained instance-level modeling and optimization. We propose a new architecture that breaks RO into a series of simpler problems, new fine-grained predictive models, and novel optimization methods that exploit these models to make effective instance-level recommendations in a hierarchical MOO framework. Evaluation using production workloads shows that our new RO system could reduce 37-72% latency and 43-78% cost at the same time, compared to the current optimizer and scheduler, while running in 0.02-0.23s.
翻译:生产规模的大型数据处理为资源优化提供了高度复杂的环境(RO),这是实现业绩目标和分析用户预算限制的一个关键问题。 RO问题具有挑战性,因为它涉及一系列决定(分割计数、在机器上放置平行实例和向每个实例分配资源),需要多目标优化(MOO),而且由于大数据系统的规模和复杂性,同时必须满足严格的时间安排时间限制,因此,生产规模和复杂性使问题更加复杂。本文介绍了一个基于MaxCompectute的综合系统,通过微小微分实例级模型和优化支持多目标资源优化。我们提出了一个新的结构,将RO变成一系列更简单的问题、新的细微细的预测模型,以及利用这些模型在等级MOO框架内提出有效的实例一级建议的新优化方法。 使用生产工作量进行的评估表明,我们新的RO系统可以同时减少37-72%的拉特度和43-78%的成本,而目前的优化和排期则以0.02-0.23的速度运行。