云中可复制和可移动的大数据分析器 (Reproducible and Portable Big Data Analytics in the Cloud)

Cloud computing has become a major approach to enable reproducible computational experiments because of its support of on-demand hardware and software resource provisioning. Yet there are still two main difficulties in reproducing big data applications in the cloud. The first is how to automate end-to-end execution of big data analytics in the cloud including virtual distributed environment provisioning, network and security group setup, and big data analytics pipeline description and execution. The second is an application developed for one cloud, such as AWS or Azure, is difficult to reproduce in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automatic scalable big data application execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. Based on the approach, we propose and develop an open-source toolkit that supports 1) on-demand distributed hardware and software environment provisioning, 2) automatic data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproducibility of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using three big data analytics applications that run on a virtual CPU/GPU cluster. Three main behaviors of our toolkit were benchmarked: i) execution overhead ratio for reproducibility support, ii) differences of reproducing the same application on AWS and Azure in terms of execution time, budgetary cost and cost-performance ratio, iii) differences between scale-out and scale-up approach for the same application on AWS and Azure.

翻译：云计算已成为一种主要方法,可以进行可复制的计算实验,原因是它支持按需提供的硬件和软件资源提供。然而,在复制云层中的大数据应用方面仍有两个主要困难。首先,如何将云层中大数据分析器的端到端执行自动化化,包括虚拟分布式环境提供、网络和安全组的设置,以及大数据分析管道描述和执行。第二,为“AWS或Azure”等一个云层开发的应用程序难以在另一个云层中复制,A.k.a.供应商锁定问题。为了解决这些问题,我们利用“无服务器”的计算和集装箱化技术来自动升级大数据应用程序的执行和再复制。第一,如何将“无服务器”的计算和“集装箱化技术”用于云层中的自动缩放大数据应用,我们提议并开发一个相同的“开放源工具包”,为“按需分配的硬件和软件支持”提供,为每个执行提供自动数据和配置存储,3个基于用户偏好度的客户模式,4 执行历史调查,和5) 在“安全”系统应用中,在“安全”的大规模数据应用中,在“Ax基准中,在“Ax”中,在“大规模”应用中,在“Ax基准中,在“我们”中,对”进行了一个“大规模”进行了一个“大规模”应用。