Data of the order of terabytes, petabytes, or beyond is known as Big Data. This data cannot be processed using the traditional database software, and hence there comes the need for Big Data Platforms. By combining the capabilities and features of various big data applications and utilities, Big Data Platforms form a single solution. It is a platform that helps to develop, deploy and manage the big data environment. Hadoop and Spark are the two open-source Big Data Platforms provided by Apache. Both these platforms have many configurational parameters, which can have unforeseen effects on the execution time, accuracy, etc. Manual tuning of these parameters can be tiresome, and hence automatic ways should be needed to tune them. After studying and analyzing various previous works in automating the tuning of these parameters, this paper proposes two algorithms - Grid Search with Finer Tuning and Controlled Random Search. The performance indicator studied in this paper is Execution Time. These algorithms help to tune the parameters automatically. Experimental results have shown a reduction in execution time of about 70% and 50% for Hadoop and 81.19% and 77.77% for Spark by Grid Search with Finer Tuning and Controlled Random Search, respectively.
翻译:此数据无法使用传统数据库软件进行处理, 因而需要使用大数据平台。 通过将各种大数据应用程序和公用设施的能力和特性结合起来, 大数据平台形成一个单一的解决方案。 这是一个有助于开发、 部署和管理大数据环境的平台。 Hadoop 和 Spark 是阿帕奇提供的两个开放源大数据平台。 这两个平台都有许多配置参数, 可能对执行时间、 准确性等产生无法预见的影响。 这些参数的手工调试可能很疲倦, 因此需要自动调试这些参数。 在研究和分析了先前为调整这些参数而进行自动化的各种工作之后, 本文提出了两种算法- 与 Finerright 调试和控制随机搜索 。 本文研究的业绩指标是“ 执行时间 ” 。 这些算法有助于自动调控参数。 实验结果显示, Hadoop 执行时间减少约70% 和 50%, Starkin 分别减少 与 Starmerning 和 Rangsearch 的执行时间, 与 Rampleg Starning 和 Resting 控制 的执行时间减少约 。