The process of data analysis, especially in GUI-based analytics systems, is highly exploratory. The user iteratively refines a workflow multiple times before arriving at the final workflow. In such an exploratory setting, it is valuable to the user if the initial results of the workflow are representative of the final answers so that the user can refine the workflow without waiting for the completion of its execution. Partitioning skew may lead to the production of misleading initial results during the execution. In this paper, we explore skew and its mitigation strategies from the perspective of the results shown to the user. We present a novel framework called Reshape that can adaptively handle partitioning skew in pipelined execution. Reshape employs a two-phase approach that transfers load in a fine-tuned manner to mitigate skew iteratively during execution, thus enabling it to handle changes in input-data distribution. Reshape has the ability to adaptively adjust skew-handling parameters, which reduces the technical burden on the users. Reshape supports a variety of operators such as HashJoin, Group-by, and Sort. We implemented Reshape on top of two big data engines, namely Amber and Flink, to demonstrate its generality and efficiency, and report an experimental evaluation using real and synthetic datasets.
翻译:数据分析过程,特别是基于图形界面的分析系统,是高度探索性的。用户在到达最后工作流程之前多次反复完善工作流程。在这样的探索环境中,如果工作流程的初步结果代表了最终答案,则对用户来说是有价值的。如果工作流程的初步结果代表了最终答案,用户就可以在不等待执行完成的情况下改进工作流程。分割扭曲可能导致在执行过程中产生误导的初步结果。在本文件中,我们从向用户显示的结果的角度探索扭曲及其缓解战略。我们提出了一个名为“再扩展”的新框架,可以适应性地处理编审中执行中断开的断流。在进行过程中,再扩展采用两阶段方法,以微调的方式转移负载,以缓解迭代制,从而使其能够处理投入数据分布的变化。再扩展具有适应性地调整Skew-处理参数的能力,从而减轻用户的技术负担。再扩展支持多种操作者,如HashJoin、Groupby和Scort等。我们在两个大型数据引擎的顶部和合成数据上,我们实施了“再扩展”和“再扩展”系统,我们用两个大型数据的顶部和合成引擎展示了“实时”和“实时”。