使用 XOR 内存的 FPGA 高载量平行散列表 (A High Throughput Parallel Hash Table on FPGA using XOR-based Memory)

Hash table is a fundamental data structure for quick search and retrieval of data. It is a key component in complex graph analytics and AI/ML applications. State-of-the-art parallel hash table implementations either make some simplifying assumptions such as supporting only a subset of hash table operations or employ optimizations that lead to performance that is highly data dependent and in the worst case can be similar to a sequential implementation. In contrast, in this work we develop a dynamic hash table that supports all the hash table queries - search, insert, delete, update, while allowing us to support 'p' parallel queries (p>1) per clock cycle via p processing engines (PEs) in the worst case i.e. the performance is data agnostic. We achieve this by implementing novel XOR based multi-ported block memories on FPGAs. Additionally, we develop a technique to optimize the memory requirement of the hash table if the ratio of search to insert/update/delete queries is known beforehand. We implement our design on state-of-the-art FPGA devices. Our design is scalable to 16 PEs and supports throughput up to 5926 MOPS. It matches the throughput of the state-of-the-art hash table design - FASTHash, which only supports search and insert operations. Comparing with the best FPGA design that supports the same set of operations, our hash table achieves up to 12.3x speedup.

翻译：散列表是快速搜索和检索数据的基本数据结构。散列表是复杂的图表分析器和 AI/ ML 应用程序中的一个关键组成部分。最先进的平行散列表执行方式, 或者是做出一些简化的假设, 例如只支持散列表操作的子集, 或者是采用最差的优化, 导致性能高度依赖数据, 而最差的则类似于相继执行。相反, 我们在此工作中开发一个动态散列表, 支持所有散列查询 - 搜索、插入、删除、更新, 同时允许我们在最差的情况下通过 p 处理引擎( PEP) 支持每个钟周期的平行查询( p> 1) 。性能是数据不可知性的。我们通过在 FPGA 上安装基于多端块记忆的新 XOR 实现这一点。此外, 我们开发了一种技术, 优化散列表的记忆要求, 如果事先知道要插入/ 更新/ 删除查询的比重查询。我们用状态的 FPGA 设备设计了我们最高级的表格, 我们的设计只能通过 PE26 设计来支持最高级的。。通过的将的格式的和格式的支持到的 AS 的设计到格式的的的向16 格式到的的的的的的的格式到的的的格式的的的格式的的的的的的的将将将支持到的的将将的的的的将将和的的的将将将将支持到的将将的的的的的的的的的将将将将将将将将将向向向向向向将向向向向向向向向的向向向的的向的的的向向向插入插入插入插入插入向向向插入插入插入插入向插入