Distributed data processing frameworks (e.g., Hadoop, Spark, and Flink) are widely used to distribute data among computing nodes of a cloud. Recently, there have been increasing efforts aimed at evaluating the performance of distributed data processing frameworks hosted in private and public clouds. However, there is a paucity of research on evaluating the performance of these frameworks hosted in a hybrid cloud, which is an emerging cloud model that integrates private and public clouds to use the best of both worlds. Therefore, in this paper, we evaluate the performance of Hadoop, Spark, and Flink in a hybrid cloud in terms of execution time, resource utilization, horizontal scalability, vertical scalability, and cost. For this study, our hybrid cloud consists of OpenStack (private cloud) and MS Azure (public cloud). We use both batch and iterative workloads for the evaluation. Our results show that in a hybrid cloud (i) the execution time increases as more nodes are borrowed by the private cloud from the public cloud, (ii) Flink outperforms Spark, which in turn outperforms Hadoop in terms of execution time, (iii) Hadoop transfers the largest amount of data among the nodes during the workload execution while Spark transfers the least amount of data, (iv) all three frameworks horizontally scale better as compared to vertical scaling, and (v) Spark is found to be least expensive in terms of $ cost for data processing while Hadoop is found the most expensive.
翻译:分布式数据处理框架(例如,Hadoop、Spark和Flink)被广泛用于在云的计算节点中分配数据。最近,人们越来越努力地评价私人和公共云中分布式数据处理框架的性能。然而,在评价以混合云为主的分布式数据处理框架的性能方面,我们缺乏关于评价这些框架在混合云中的业绩的研究,这种混合云是一种新兴的云模型,将私人和公共云结合起来,以最佳地利用两个世界。因此,在本文件中,我们从执行时间、资源利用、水平缩放、垂直缩放和成本等方面评估Hadoop和Flink在混合云中的性能。最近,我们的混合云包括OptStack(私人云)和MS Azure(公共云),我们用批量和迭代工作量来进行评价。我们的结果显示,在混合云中,(一)执行时间增加,因为私人云从公共云中借用了更多的节点,(二)Flint ofperps prinks Sprinks spark,这在执行过程中并不反映Hadhotoforforfy, 而 frotoop deal riformode rial detrade cal latium) lade lade lade ladudududududududududududude dal de dal lax) lax lade lade lade axaltime laxaltium laxaldaldaldaldaldaldaldaldaldaltradududududududududude ladal (在执行规模中发现所有数据数量中发现到所有数据数量,而所有数据数量是所有数据数量,在执行规模中发现所有数据规模中发现所有数据数量,在执行规模中发现所有数据数量,在执行规模中发现所有数据的规模是最大的数额中发现所有数据数量,在执行规模中发现所有数据数量,在执行规模中发现所有数据规模中发现所有数据数量,在Spraltrade dal vidaltradaltixtialtial vicaltradalti) trade atialtialtialtialtialtialtialtial lax