This paper proposes a model for specifying data flow based parallel data processing programs agnostic of target Big Data processing frameworks. The paper focuses on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by data flow Big Data processing frameworks. The proposed model relies on monoid AlgebraandPetri Netstoabstract Big Data processing programs in two levels: a high level representing the program data flow and a lower level representing data transformation operations (e.g., filtering, aggregation, join). We extend the model for data processing programs proposed in [1], to enable the use of iterative programs. The general specification of iterative data processing programs implemented by data flow-based parallel programming models is essential given the democratization of iterative and greedy Big Data analytics algorithms. Indeed, these algorithms call for revisiting parallel programming models to express iterations. The paper gives a comparative analysis of the iteration strategies proposed byApache Spark, DryadLINQ, Apache Beam and Apache Flink. It discusses how the model achieves to generalize these strategies.
翻译:本文提出一个模式,用于说明基于数据流的平行数据处理程序,对目标大数据处理框架进行不可知性。本文侧重于非模拟和迭代程序的正式抽象规格,概括数据流大数据处理框架采用的战略。拟议的模型依赖单级的 AlgebraandPetri Netstobtstrap 大型数据处理程序,分为两个层次:一个代表程序数据流的高层次,另一个代表数据转换操作的较低层次(例如过滤、汇总、合并)。我们扩展了[1]中提议的数据处理程序模型,以便能够使用迭接程序。鉴于反复和贪婪的大数据分析算法的民主化,由数据流平行程序模型执行的迭代数据处理程序的一般规格至关重要。事实上,这些算法要求重新审查平行的编程模型,以表达迭代。本文对Apache Spark、DryadLINQ、Apache Beam和Apache Flink提出的迭接战略进行了比较分析。它讨论了模型如何将这些战略概括化。