Frequent itemset mining (FIM) is a highly computational and data intensive algorithm. Therefore, parallel and distributed FIM algorithms have been designed to process large volume of data in a reduced time. Recently, a number of FIM algorithms have been designed on Hadoop MapReduce, a distributed big data processing framework. But, due to heavy disk I/O, MapReduce is found to be inefficient for the highly iterative FIM algorithms. Therefore, Spark, a more efficient distributed data processing framework, has been developed with in-memory computation and resilient distributed dataset (RDD) features to support the iterative algorithms. On this framework, Apriori and FP-Growth based FIM algorithms have been designed on the Spark RDD framework, but Eclat-based algorithm has not been explored yet. In this paper, RDD-Eclat, a parallel Eclat algorithm on the Spark RDD framework is proposed with its five variants. The proposed algorithms are evaluated on the various benchmark datasets, and the experimental results show that RDD-Eclat outperforms the Spark-based Apriori by many times. Also, the experimental results show the scalability of the proposed algorithms on increasing the number of cores and size of the dataset.
翻译:经常项目开采(FIM)是一种高度计算和数据密集的算法。 因此, 已经设计了平行和分布式的FIM算法, 以便在减少的时间里处理大量数据。 最近, 在分布的大型数据处理框架Hadoop MapRduce上设计了一些基于FIM算法的FIM算法。 但是,由于磁盘I/ O, MapReduce被认为对高迭接的FIM算法是低效的。 因此, Spark是一个效率更高的分布式数据处理框架, 已经与模拟计算和弹性分布式数据集(RDDD)的功能一起开发, 以支持迭代算法。 在这个框架里, 基于FIM算法的Ariori和FP-Growth算法已经设计在Spoint RDD框架中设计, 但基于Eclat的算法还没有被探索。 在本文中, Spoint RDDD框架的平行的Eclat 算法与五个变式。 提议的算法是用各种基准数据集进行评估的, 实验结果显示RDD- Elat 超越了基于许多时间的实验性 Apractalal 的数值。