Systolic arrays and shared L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit data flow management and synchronization. This work aims at enabling efficient systolic execution on shared L1-memory manycore clusters. We devise a flexible architecture where small and energy-efficient RISC-V cores act as the systolic array's processing elements (PEs) and can form diverse, reconfigurable systolic topologies through queues mapped in the cluster's shared memory. We introduce two low-overhead RISC-V ISA extensions for efficient systolic execution, namely Xqueue and Queue-linked registers (QLRs), which support queue management in hardware. The Xqueue extension enables single-instruction access to shared-memory-mapped queues, while QLRs allow implicit and autonomous access to them, relieving the cores of explicit communication instructions. We demonstrate Xqueue and QLRs in MemPool, an open-source manycore cluster with 256 PEs, and analyze the hybrid systolic-shared-memory architecture's trade-offs on matrix multiplication, convolution, and FFT kernels. For an area increase of just 6%, our hybrid architecture almost doubles MemPool's compute unit utilization to up to 95% and significantly improves energy efficiency, achieving up to 63% of power spent in the PEs. In typical conditions (TT/0.80V/25{\deg}C) in a 22nm FDX technology, our hybrid architecture runs at 600MHz with no frequency degradation and is up to 64% more energy efficient than the shared-memory baseline, achieving up to 208GOPS/W.
翻译:暂无翻译