GPU 启动的 BaM 系统结构中高压压下存储设施 (GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture)

Zaid Qureshi,Vikram Sharma Mailthody,Isaac Gelado,Seung Won Min,Amna Masood,Jeongmin Park,Jinjun Xiong,CJ Newburn,Dmitri Vainbrand,I-Hsin Chung,Michael Garland,William Dally,Wen-mei Hwu

from arxiv, This is an extension to the published conference paper at ASPLOS'23: https://dl.acm.org/doi/abs/10.1145/3575693.3575748

Graphics Processing Units (GPUs) have traditionally relied on the host CPU to initiate access to the data storage. This approach is well-suited for GPU applications with known data access patterns that enable partitioning of their dataset to be processed in a pipelined fashion in the GPU. However, emerging applications such as graph and data analytics, recommender systems, or graph neural networks, require fine-grained, data-dependent access to storage. CPU initiation of storage access is unsuitable for these applications due to high CPU-GPU synchronization overheads, I/O traffic amplification, and long CPU processing latencies. GPU-initiated storage removes these overheads from the storage control path and, thus, can potentially support these applications at much higher speed. However, there is a lack of systems architecture and software stack that enable efficient GPU-initiated storage access. This work presents a novel system architecture, BaM, that fills this gap. BaM features a fine-grained software cache to coalesce data storage requests while minimizing I/O traffic amplification. This software cache communicates with the storage system via high-throughput queues that enable the massive number of concurrent threads in modern GPUs to make I/O requests at a high rate to fully utilize the storage devices and the system interconnect. Experimental results show that BaM delivers 1.0x and 1.49x end-to-end speed up for BFS and CC graph analytics benchmarks while reducing hardware costs by up to 21.7x over accessing the graph data from the host memory. Furthermore, BaM speeds up data-analytics workloads by 5.3x over CPU-initiated storage access on the same hardware.

翻译：图形处理器( GPU) 传统上依赖主机 CPU 启动存储权限, 以启动数据存储。这种方法非常适合已知数据访问模式的 GPU 应用程序, 从而能够以管道方式在 GPU 中处理其数据集。但是, 新的应用程序, 如图形和数据分析器、推荐器系统或图形神经网络, 需要精密的、取决于数据的存储权限。 CPU 启动存储权限不适合这些应用程序, 原因是 CPU- GPU 同步管理器、 I/ O 流量放大以及长期的 CPU 处理延迟。 GPU 启动的存储器将这些数据集从存储控制路径中删除, 从而有可能以更快的速度支持这些应用程序。然而, 缺少能够高效 GPUE 启动的存储访问权限的系统架构和软件堆叠。这项工作提供了一个新型的系统架构, BaM 填补了这一空白。 BaM 将一个精密的软件缓存存储器, 用来在最小化的I/ O上存储器存储器存储器的存储器, 通过高超速存储器运行的存储器系统, 使IPUPUDL 系统能够完全使用IL 。