All-pairs shortest paths (APSP) is a fundamental algorithm used for routing, logistics, and network analysis, but the cubic time complexity and heavy data movement of the canonical Floyd-Warshall (FW) algorithm severely limits its scalability on conventional CPUs or GPUs. In this paper, we propose PIM-FW, a novel co-designed hardware architecture and dataflow that leverages processing in and near memory architecture designed to accelerate blocked FW algorithm on an HBM3 stack. To enable fine-grained parallelism, we propose a massively parallel array of specialized bit-serial bank PE and channel PE designed to accelerate the core min-plus operations. Our novel dataflow complements this hardware, employing an interleaved mapping policy for superior load balancing and hybrid in and near memory computing model for efficient computation and reduction. The novel in-bank computing approach allows all distance updates to be performed and stored in memory bank, a key contribution is that eliminates the data movement bottleneck inherent in GPU-based approaches. We implement a full software and hardware co-design using a cycle-accurate simulator to simulate an 8-channel, 4-Hi HBM3 PIM stack on real road-network traces. Experimental results show that, for a 8192 x 8192 graph, PIM-FW achieves a 18.7x speedup in end-to-end execution, and consumes 3200x less DRAM energy compared to a state-of-the-art GPU-only Floyd-Warshall.
翻译:全对最短路径(APSP)是用于路由规划、物流优化和网络分析的基础算法,但经典Floyd-Warshall(FW)算法的立方时间复杂度和大量数据移动严重限制了其在传统CPU或GPU上的可扩展性。本文提出PIM-FW——一种新颖的协同设计硬件架构与数据流,利用内存内及近内存处理架构在HBM3堆栈上加速分块FW算法。为实现细粒度并行,我们设计了由专用位串行存储体处理单元和通道处理单元组成的大规模并行阵列,以加速核心的min-plus运算。创新的数据流与硬件互补,采用交错映射策略实现卓越的负载均衡,并结合内存内与近内存混合计算模型以高效执行计算与规约。新型存储体内计算方法使所有距离更新操作均可在存储体内完成并存储,其关键贡献在于消除了基于GPU方案固有的数据移动瓶颈。我们通过周期精确模拟器实现了完整的软硬件协同设计,在真实路网轨迹上模拟了8通道、4层堆叠的HBM3 PIM系统。实验结果表明,对于8192×8192规模的图数据,PIM-FW相比最先进的纯GPU Floyd-Warshall实现,端到端执行速度提升18.7倍,DRAM能耗降低3200倍。