There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The efficiency of these accelerators comes from employing optimized dataflow (i.e., spatial/temporal partitioning of data across the PEs and fine-grained scheduling) strategies to optimize data reuse. The focus of this work is to evaluate these accelerator architectures using a tiled general matrix-matrix multiplication (GEMM) kernel. To do so, we develop a framework that finds optimized mappings (dataflow and tile sizes) for a tiled GEMM for a given spatial accelerator and workload combination, leveraging an analytical cost model for runtime and energy. Our evaluations over five spatial accelerators demonstrate that the tiled GEMM mappings systematically generated by our framework achieve high performance on various GEMM workloads and accelerators.
翻译:对机械学习应用的定制空间加速器越来越感兴趣。 这些加速器采用通过自定义缓冲等级和网络在芯片上互动的空间处理元件阵列。 这些加速器的效率来自采用优化的数据流(即数据在PE之间的空间/时际分隔和细微的排程)优化数据再利用的战略。 这项工作的重点是使用一个加压的通用矩阵矩阵-矩阵倍增内核(GEMM)来评估这些加速器结构。 为了做到这一点,我们开发了一个框架,为特定空间加速器和工作量组合找到优化的GEMM绘图(数据流和体积大小),利用运行时间和能量的分析成本模型。 我们对五个空间加速器的评价表明,我们框架系统生成的压压式GEMM绘图在各种GEMM工作量和加速器上取得了很高的性能。