Distributed dataflow systems enable data-parallel processing of large datasets on clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. Yet, selecting appropriate cloud resources for dataflow jobs - that neither lead to bottlenecks nor to low resource utilization - is often challenging, even for expert users such as data engineers. We present C3O, a collaborative system for optimizing data processing cluster configurations in public clouds based on shared historical runtime data. The shared data is utilized for predicting the runtimes of data processing jobs on different possible cluster configurations, using specialized regression models. These models take the diverse execution contexts of different users into account and exhibit mean absolute errors below 3% in our experimental evaluation with 930 unique Spark jobs.
翻译:分布式数据流系统能够对大型集群数据集进行数据平行处理。 公共云源提供者提供了可用于这些集群的大量种类和数量丰富的资源。 然而,为数据流工作选择适当的云源资源(既不导致瓶颈,也不导致资源利用率低)往往具有挑战性,即使是数据工程师等专家用户也是如此。 我们提供了C3O,这是一个合作系统,用于根据共同的历史运行时间数据优化公共云层中的数据处理集群配置。共享数据用于预测不同可能的集群配置的数据处理工作运行时间,使用专门的回归模型。这些模型考虑到不同用户的不同执行环境,在实验评估中,有930个独特的Spark工作,显示绝对错误低于3%。