A Join-Project operation is a join operation followed by a duplicate eliminating projection operation. It is used in a large variety of applications, including entity matching, set analytics, and graph analytics. Previous work proposes a hybrid design that exploits the classical solution (i.e., join and deduplication), and MM (matrix multiplication) to process the sparse and the dense portions of the input data, respectively. However, we observe three problems in the state-of-the-art solution: 1) The outputs of the sparse and dense portions overlap, requiring an extra deduplication step; 2) Its table-to-matrix transformation makes an over-simplified assumption of the attribute values; and 3) There is a mismatch between the employed MM in BLAS packages and the characteristics of the Join-Project operation. In this paper, we propose DIM3, an optimized algorithm for the Join-Project operation. To address 1), we propose an intersection-free partition method to completely remove the final deduplication step. For 2), we develop an optimized design for mapping attribute values to natural numbers. For 3), we propose DenseEC and SparseBMM algorithms to exploit the structure of Join-Project for better efficiency. Moreover, we extend DIM3 to consider partial result caching and support Join-op queries, including Join-Aggregate and MJP (Multi-way Joins with Projection). Experimental results using both real-world and synthetic data sets show that DIM3 outperforms previous Join-Project solutions by a factor of 2.3x-18x. Compared to RDBMSs, DIM3 achieves orders of magnitude speedups.
翻译:合并项目操作是一个连成操作, 并随后有一个重复的删除投影操作。 它用于多种应用, 包括实体匹配、 设置分析器和图形分析。 先前的工作提议了一个混合设计, 利用经典解决方案( 加入和减少) 和 MM( 矩阵倍增) 来分别处理输入数据的稀薄和稠密部分。 然而, 我们观察到了最先进的解决方案中的三个问题 :(1) 稀薄和稠密部分的重叠输出, 需要额外的解析步骤 ; (2) 其表对矩阵的变换使属性值的假设过于简单化 ; 和 3 先前的工作提议了一个混合设计, 利用经典的 MMM 和 MM( 矩阵的组合 ) 来分别处理输入数据的稀薄和稠密部分 。 然而, 我们提出一个不交叉的分区分配方法, 以完全消除最后的解析步骤 。 2, 我们开发一个优化的设计, 用于绘制自然数字的属性。 3, 我们提议在 BLLAS3 组合中, 将 DIM 和 IM 部分 的 RIL 分析 进行一个更好的结果 。