As one of the most useful online processing techniques, the theta-join operation has been utilized by many applications to fully excavate the relationships between data streams in various scenarios. As such, constant research efforts have been put to optimize its performance in the distributed environment, which is typically characterized by reducing the number of Cartesian products as much as possible. In this article, we design and implement a novel fast theta-join algorithm, called Prefap, by developing two distinct techniques - prefiltering and amalgamated partitioning-based on the state-of-the-art FastThetaJoin algorithm to optimize the efficiency of the theta-join operation. Firstly, we develop a prefiltering strategy before data streams are partitioned to reduce the amount of data to be involved and benefit a more fine-grained partitioning. Secondly, to avoid the data streams being partitioned in a coarse-grained isolated manner and improve the quality of the partition-level filtering, we introduce an amalgamated partitioning mechanism that can amalgamate the partitioning boundaries of two data streams to assist a fine-grained partitioning. With the integration of these two techniques into the existing FastThetaJoin algorithm, we design and implement a new framework to achieve a decreased number of Cartesian products and a higher theta-join efficiency. By comparing with existing algorithms, FastThetaJoin in particular, we evaluate the performance of Prefap on both synthetic and real data streams from two-way to multiway theta-join to demonstrate its superiority.
翻译:作为最有用的在线处理技术之一,许多应用都利用Tata-join操作来充分挖掘不同情景中数据流之间的关系。因此,不断开展研究努力优化分布环境中的性能,其典型特征是尽可能减少笛卡尔产品的数量。在本篇文章中,我们设计并实施了名为Prefap的新型快速塔-join算法,开发了两种不同的技术,即预过滤和混合的流分配法,其基础是最新的快速快速快速快速快速快速快递计算法,即快速快速快速快速快递计算法,以优化塔-join操作的效率。首先,我们在数据流分配之前,制定预过滤战略,以减少需要参与的数据数量,并尽可能地实现更精细的分割。第二,为了避免数据流以粗的孤立孤立方式进行分割,提高分区级过滤的质量,我们引入了一种混合的分区分解机制,将两个数据流的分解法混合起来,以优化Jo-jo-join操作的效率。首先,我们制定预审战略,然后将两个新的算法整合到我们现有的数据结构中。