Big data applications and analytics are employed in many sectors for a variety of goals: improving customers satisfaction, predicting market behavior or improving processes in public health. These applications consist of complex software stacks that are often run on cloud systems. Predicting execution times is important for estimating the cost of cloud services and for effectively managing the underlying resources at runtime. Machine Learning (ML), providing black box solutions to model the relationship between application performance and system configuration without requiring in-detail knowledge of the system, has become a popular way of predicting the performance of big data applications. We investigate the cost-benefits of using supervised ML models for predicting the performance of applications on Spark, one of today's most widely used frameworks for big data analysis. We compare our approach with \textit{Ernest} (an ML-based technique proposed in the literature by the Spark inventors) on a range of scenarios, application workloads, and cloud system configurations. Our experiments show that Ernest can accurately estimate the performance of very regular applications, but it fails when applications exhibit more irregular patterns and/or when extrapolating on bigger data set sizes. Results show that our models match or exceed Ernest's performance, sometimes enabling us to reduce the prediction error from 126-187% to only 5-19%.
翻译:在许多部门,使用大数据应用程序和分析方法实现多种目标:提高客户满意度、预测市场行为或改善公共卫生流程。这些应用程序包括复杂的软件堆叠,这些堆叠往往在云层系统中运行。预测执行时间对于估算云服务的成本和在运行时有效管理基本资源十分重要。机器学习(ML)提供了黑盒解决方案,用以在不要求系统详细了解的情况下模拟应用性能和系统配置之间的关系,从而成为预测大数据应用程序绩效的流行方式。我们调查了使用监管的 ML 模型预测斯帕克应用程序的性能的成本效益,斯帕克是当今最广泛使用的大数据分析框架之一。我们比较了我们的方法与textit{ERnest}(Splark发明者在文献中提议的基于ML的技术)在一系列假设、应用工作量和云系统配置方面的关系。我们的实验显示,欧内斯特可以准确估计非常常规应用程序的性能,但是当应用显示更不规则的模式和/或外推我们较大型数据组的Spractive-19的性能模型时,我们只能将业绩模型比更小于更大规模的186号。结果,有时比我们的性模型比我们更小的为186号。结果。