NUMA系统可缩放的散列表 (Scalable Hash Table for NUMA Systems)

Hash tables are used in a plethora of applications, including database operations, DNA sequencing, string searching, and many more. As such, there are many parallelized hash tables targeting multicore, distributed, and accelerator-based systems. We present in this work a multi-GPU hash table implementation that can process keys at a throughput comparable to that of distributed hash tables. Distributed CPU hash tables have received significantly more attention than GPU-based hash tables. We show that a single node with multiple GPUs offers roughly the same performance as a 500-1,000-core CPU-based cluster. Our algorithm's key component is our use of multiple sparse-graph data structures and binning techniques to build the hash table. As has been shown individually, these components can be written with massive parallelism that is amenable to GPU acceleration. Since we focus on an individual node, we also leverage communication primitives that are typically prohibitive in distributed environments. We show that our new multi-GPU algorithm shares many of the same features of the single GPU algorithm -- thus we have efficient collision management capabilities and can deal with a large number of duplicates. We evaluate our algorithm on two multi-GPU compute nodes: 1) an NVIDIA DGX2 server with 16 GPUs and 2) an IBM Power 9 Processor with 6 NVIDIA GPUs. With 32-bit keys, our implementation processes 8B keys per second, comparable to some 500-1,000-core CPU-based clusters and 4X faster than prior single-GPU implementations.

翻译：散列表格用于大量应用, 包括数据库操作、 DNA测序、字符串搜索等。因此, 有许多平行散列表格, 以多极、分布和加速器为基础的系统为对象。我们在此工作中展示了多组 GPU 散列表格执行程序, 可以在可与散列散列表格相匹配的输送量中处理键。分布式 CPU 散列表格比基于 GPU 的散列表格得到的注意要多得多。我们显示, 多组 GPU 的单一节点提供大约与500-1 000 核心 CPU 集群大致相同的性能。我们的算法关键部分是我们使用多组分散式数据结构和硬化技术来建立散列表格。正如我们单独显示的那样, 这些组件可以用与可与发散列散列散列散列表格表表表相匹配的大规模平行程序来写。由于我们关注单个节点, 我们还利用了在分布式环境中通常令人讨厌的通信源。我们显示, 我们新的多组的运算法与单一的GPU- PU 1 具有许多相同的特性- 。因此, 我们具有两个可比较的碰撞管理功能的 CG GVA 的 C- 的 C- 级 C- 的 C- 和 C- 级 C- 的 CG 1 的 C- dVB 1 的和 C- 级的的的的的的的的的的的的的 10 和 C- d- 的的的的的的的的的的的的的的的和的的的的的的和的的的的的的的的的的的的的的的和的的的的的的的的的的的的的的的的的的的和的的的的的的的的的的的的和的的的的的的的的的的