The task of joining two tables is fundamental for querying databases. In this paper, we focus on the equi-join problem, where a pair of records from the two joined tables are part of the join results if equality holds between their values in the join column(s). While this is a tractable problem when the number of records in the joined tables is relatively small, it becomes very challenging as the table sizes increase, especially if hot keys (join column values with a large number of records) exist in both joined tables. This paper, an extended version of [metwally-SIGMOD-2022], proposes Adaptive-Multistage-Join (AM-Join) for scalable and fast equi-joins in distributed shared-nothing architectures. AM-Join utilizes (a) Tree-Join, a proposed novel algorithm that scales well when the joined tables share hot keys, and (b) Broadcast-Join, the known fastest when joining keys that are hot in only one table. Unlike the state-of-the-art algorithms, AM-Join (a) holistically solves the join-skew problem by achieving load balancing throughout the join execution, and (b) supports all outer-join variants without record deduplication or custom table partitioning. For the fastest AM-Join outer-join performance, we propose the Index-Broadcast-Join (IB-Join) family of algorithms for Small-Large joins, where one table fits in memory and the other can be up to orders of magnitude larger. The outer-join variants of IB-Join improves on the state-of-the-art Small-Large outer-join algorithms. The proposed algorithms can be adopted in any shared-nothing architecture. We implemented a MapReduce version using Spark. Our evaluation shows the proposed algorithms execute significantly faster and scale to more skewed and orders-of-magnitude bigger tables when compared to the state-of-the-art algorithms.
翻译:加入两个表格的任务对于查询数据库来说至关重要 。 在本文中, 我们聚焦于“ equi-join ” 问题, 在两个合并的表格中, 两个合并的表格中的一对记录如果在加入的列中的值之间保持平等, 则这对记录是合并的结果的一部分。 当合并的表格中记录的数量相对较少时, 这是一个可移动的问题, 当表格大小增加时, 特别是当两个合并的表格中存在热键( join 列数值, 记录数量众多) 。 本文是 [ Joqual- SIGMOD-2022] 的扩展版本, 提议在分布的共享的列列列列列中, 将更大型的 Outral- squal- dejoin (AM- Join) 用于可调整的和快速的组合。 AM- 使用合并的自动格式, 当组合的表格共享的S- 和自动版本时, 将使用最小规模的列表中已知速度。