Shared L1 memory clusters are a common architectural pattern (e.g., in GPGPUs) for building efficient and flexible multi-processing-element (PE) engines. However, it is a common belief that these tightly-coupled clusters would not scale beyond a few tens of PEs. In this work, we tackle scaling shared L1 clusters to hundreds of PEs while supporting a flexible and productive programming model and maintaining high efficiency. We present MemPool, a manycore system with 256 RV32IMAXpulpimg "Snitch" cores featuring application-tunable functional units. We designed and implemented an efficient low-latency PE to L1-memory interconnect, an optimized instruction path to ensure each PE's independent execution, and a powerful DMA engine and system interconnect to stream data in and out. MemPool is easy to program, with all the cores sharing a global view of a large, multi-banked, L1 scratchpad memory, accessible within at most five cycles in the absence of conflicts. We provide multiple runtimes to program MemPool at different abstraction levels and illustrate its versatility with a wide set of applications. MemPool runs at 600 MHz (60 gate delays) in typical conditions (TT/0.80V/25{\deg}C) in 22 nm FDX technology and achieves a performance of up to 229 GOPS or 192 GOPS/W with less than 2% of execution stalls.
翻译:共享L1缓存集群是构建高效灵活的多处理元素(PE)引擎的常见架构模式(例如,在GPGPUs中)。然而,人们普遍认为这些紧密耦合的集群不会在几十个PE之外扩展。在这项工作中,我们解决了将共享L1集群扩展到数百个PE时支持灵活高效的编程模型并保持高效的难题。我们提出了MemPool,一个具有256个RV32IMAXpulpimg“Snitch”内核的多核系统,具有应用程序可调功能单元。我们设计和实现了一种高效低延迟的PE到L1存储器互联,优化了指令路径以确保每个PE的独立执行,以及强大的DMA引擎和系统互连以流式传输数据。MemPool易于编程,所有内核共享大型多模块L1缓存,可在不发生冲突的情况下在最多五个周期内访问。我们为MemPool提供了多个运行时,以不同的抽象级别来编程,并使用广泛的应用程序说明了其多功能性。MemPool在22纳米FDX工艺条件下典型工作温度(TT/0.80V/25°C)下以600 MHz的速度运行(60个门延迟),并实现了最高229 GOPS或最高192 GOPS/W的性能,执行停顿不到2%。