As data volumes grow across applications, analytics of large amounts of data is becoming increasingly important. Big data processing frameworks such as Apache Hadoop, Apache AsterixDB, and Apache Spark have been built to meet this demand. A common objective pursued by these traditional cluster-based big data processing frameworks is high performance, which often means low end-to-end execution time or latency. The widespread adoption of data analytics has led to a call to improve the traditional ways of big data processing. There have been demands for making the analytics process more interactive and adaptive, especially for long running jobs. The importance of initial results in the iterative process of data wrangling has motivated a result-aware approach to big data analytics. This dissertation is motivated by these calls for improvement in data processing and the experiences over the past few years while working on the Texera project, which is a collaborative data analytics service being developed at UC Irvine. This dissertation mainly consists of three parts. The first part is about the design of the Amber engine that serves as the backend data processing framework for the Texera service. The second part is about an adaptive and result-aware skew-handling framework called Reshape. Reshape uses fast control messages to implement iterative skew mitigation techniques for a wide variety of operators. The mitigation techniques in Reshape have also been analyzed from the perspective of their effects on the results shown to the user. The last part is about a result-aware workflow scheduling framework called Maestro. This part talks about how to schedule a workflow for execution on computing clusters and make result-aware decisions while doing so. This work improves the data analytics process by bringing interactivity, adaptivity and result-awareness into the process.
翻译:随着应用程序中数据量的增加,大量数据的分析正在变得日益重要。大型数据处理框架,如Apache Hadoop、Apache AsterixDB和Apache Spark等大型数据处理框架已经建立以满足这一需求。这些传统的集束型大数据处理框架追求的一个共同目标是高性能,这通常意味着低端到端执行时间或延迟度。广泛采用数据分析方法导致人们呼吁改进传统的大数据处理方式。要求使分析流程更加互动和适应性化,特别是长期运行的工作。数据循环循环中初步结果的重要性已经促使对大数据分析采用有觉悟的方法。这种失常的动机是要求改进数据处理过程和过去几年的经验,而Texerra项目则是通过UC Irvine开发的合作数据分析服务。这种变现主要由三个部分组成。关于数据变异性流程的初步设计,这是关于对数据变异性流程的第二个设计,它作为大数据分析流程的一部分,用来对大数据解析框架进行后端处理过程使用。在调时,Repesha 正在使用一个自动变现后变现后变式处理结果。 将数据框架使用一个对调后变换后变式处理结果的结果。Recheal 正在使用一个Sermakeal 。 将一个调制成一个调算结果,这是一个Sermakeal 要求一个对调算的结果。一个对调算结果, 一种调后算结果,这是一个对一个对一个对一个调后变后变后变后变后变式的计算结果。一个调算法框架进行一个调的结果。