With the ever-growing heterogeneity in computing systems, driven by modern machine learning applications, pressure is increasing on memory systems to handle arbitrary and more demanding transfers efficiently. Descriptor-based direct memory access controllers (DMACs) allow such transfers to be executed by decoupling memory transfers from processing units. Classical descriptor-based DMACs are inefficient when handling arbitrary transfers of small unit sizes. Excessive descriptor size and the serialized nature of processing descriptors employed by the DMAC lead to large static overheads when setting up transfers. To tackle this inefficiency, we propose a descriptor-based DMAC optimized to efficiently handle arbitrary transfers of small unit sizes. We implement a lightweight descriptor format in an AXI4-based DMAC. We further increase performance by implementing a low-overhead speculative descriptor prefetching scheme without additional latency penalties in the case of a misprediction. Our DMAC is integrated into a 64-bit Linux-capable RISC-V SoC and emulated on a Kintex FPGA to evaluate its performance. Compared to an off-the-shelf descriptor-based DMAC IP, we achieve 1.66x less latency launching transfers, increase bus utilization up to 2.5x in an ideal memory system with 64-byte-length transfers while requiring 11% fewer lookup tables, 23% fewer flip-flops, and no block RAMs. We can extend our lead in bus utilization to 3.6x with 64-byte-length transfers in deep memory systems. We synthesized our DMAC in GlobalFoundries' GF12LP+ node, achieving a clock frequency of over 1.44 GHz while occupying only 49.5 kGE.
翻译:暂无翻译