Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to low resource utilization -- is often challenging, even for expert users such as data engineers. Further, existing automated approaches to resource selection rely on the assumption that a job is recurring to learn from previous runs or to warrant the cost of full test runs to learn from. However, this assumption often does not hold since many jobs are too unique. Therefore, we present Crispy, a method for optimizing data processing cluster configurations based on job profiling runs with small samples of the dataset on just a single machine. Crispy attempts to extrapolate the memory usage for the full dataset to then choose a cluster configuration with enough total memory. In our evaluation on a dataset with 1031 Spark and Hadoop jobs, we see a reduction of job execution costs by 56% compared to the baseline, while on average spending less than ten minutes on profiling runs per job on a consumer-grade laptop.
翻译:Apache Spark 和 Apache Hadoop 等分布式数据流系统使大型集群数据集的数据平行处理成为可能。然而,为数据流工作选择适当的计算资源 -- -- 既不导致瓶颈,也不导致资源利用率低 -- -- 往往具有挑战性,甚至对数据工程师等专家用户也是如此。此外,现有的资源选择自动化方法依赖于以下假设:一个工作会反复从以前的运行中学习,或者需要全额测试运行的成本才能学习。然而,这一假设往往不会有效,因为许多工作都过于独特。因此,我们提出Crispy,这是一个基于工作特征分析优化数据处理集群配置的方法,其基础是仅用一个单一机器的小型数据集样本进行工作特征分析。Crispy试图对全数据集的记忆使用进行外推,然后选择一个拥有足够全部记忆的集配置。在1031 Spark 和 Hadoop 工作的数据集评价中,我们看到,与基线相比,任务执行成本减少了56%,而平均花费不到10分钟的客户级笔记账上每个工作的情况。