Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patterns. Matrix multiplication is no exception, and lower bounds have been proven and implemented both for shared and distributed memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing algorithms, as they offer full control of memory accesses to the programmer. While bounds developed in the context of fixed architectures still apply to these platforms, the spatially distributed nature of their computational and memory resources requires a decentralized approach to optimize algorithms for maximum hardware utilization. We present a model to optimize matrix multiplication for FPGA platforms, simultaneously targeting maximum performance and minimum off-chip data movement, within constraints set by the hardware. We map the model to a concrete architecture using a high-level synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary data types, and enables maintainability and portability across FPGA devices. Kernels generated from our architecture are shown to offer competitive performance in practice, scaling with both compute and memory resources. We offer our design as an open source project to encourage the open development of linear algebra and I/O minimizing algorithms on reconfigurable hardware platforms.
翻译:数据移动是影响现代计算系统中性能和能源的主导因素,因此,已经制定了许多算法,以尽量减少用于共同计算模式的I/O操作数量。矩阵乘法并非例外,对于共享和分布的内存系统,已经证明并实施了较低的界限。可重新配置的硬件平台是I/O最小化算法的一个有利目标,因为它们为程序员的内存访问提供了充分控制。虽然在固定结构背景下开发的界限仍然适用于这些平台,但其计算和记忆资源的空间分布性质要求采用分散化的方法优化用于最大硬件利用的I/O算法。我们提出了一个模型,优化FPGA平台的矩阵乘法乘法,同时在硬件设定的限制范围内针对最大性能和最小离芯数据移动。我们用高层次综合工具将模型绘制成一个具体架构,保持高程度的抽象化,使我们能够支持任意的数据类型,并能够在整个FPGA装置中保持可移动性和可移动性。从我们结构中生成的Kenels展示了在实际操作中的竞争性性性表现,同时以可压缩和记忆资源逐步缩放成。我们提供可移动的硬质的I的硬体模型设计。