云中可复制和可移动的大数据分析器 (Reproducible and Portable Big Data Analytics in the Cloud)

Cloud computing has become a major approach to help reproduce computational experiments. Yet there are still two main difficulties in reproducing batch based big data analytics (including descriptive and predictive analytics) in the cloud. The first is how to automate end-to-end scalable execution of analytics including distributed environment provisioning, analytics pipeline description, parallel execution, and resource termination. The second is that an application developed for one cloud is difficult to be reproduced in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automated scalable execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. We propose and develop an open-source toolkit that supports 1) fully automated end-to-end execution and reproduction via a single command, 2) automated data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproduction of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using four big data analytics applications that run on virtual CPU/GPU clusters. The experiments show our toolkit can achieve good execution performance, scalability, and efficient reproducibility for cloud-based big data analytics.

翻译：云计算已成为帮助复制计算实验的主要方法。然而,在复制云层中的批量大数据分析(包括描述和预测分析分析)方面仍有两个主要困难。首先是如何自动实现终端到终端分析的可扩展执行, 包括分布式环境提供、分析管道描述、平行执行和资源终止。其次, 为一个云开发的应用程序很难在另一个云中复制, a.k.a. 供应商锁定问题。为了解决这些问题,我们利用无服务器的计算和集装箱化技术进行自动可扩缩的执行和再复制,并利用适应器设计模式,使不同云层的可移植和再复制。我们提出并开发了一个开放源工具包,支持1) 通过单一指令完全自动化的终端到终端执行和复制; 2) 每项执行的自动数据和配置存储; 3) 基于用户偏好、执行历史查询的灵活客户模式, 以及 5) 在同一环境或不同环境中简单复制现有的处决。我们在AWS和Azrevelable两个环境上都进行了广泛的实验, 我们对AWS和Azreal设计进行了广泛的实验, 展示了4个大数据可运行的虚拟应用。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

神经常微分方程教程，50页ppt，A brief tutorial on Neural ODEs

专知会员服务

74+阅读 · 2020年8月2日

数据科学导论，54页ppt，Introduction to Data Science

专知会员服务

42+阅读 · 2020年7月27日