高绩效计算高级综合代码转换 (Transformations of High-Level Synthesis Codes for High-Performance Computing)

Spatial computing architectures promise a major stride in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target spatial computing architectures, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes, due to fundamentally distinct aspects of hardware design, such as programming for deep pipelines, distributed memory resources, and scalable routing. To alleviate this, we present a collection of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. We systematically identify classes of transformations (pipelining, scalability, and memory), the characteristics of their effect on the HLS code and the resulting hardware (e.g., increasing data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip dataflow, allowing for massively parallel architectures. To quantify the effect of various transformations, we cover the optimization process of a sample set of HPC kernels, provided as open source reference codes. We aim to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential offered by spatial computing architectures using HLS.

翻译：空间计算架构有望大大提升目前大型计算系统使用的传统负荷/储存装置的性能和能效。采用C++和OpenCL等语言的高水平合成(HLS),大大提高了设计此类平台的程序生产率。虽然这使得更多的受众能够针对空间计算架构,但传统软件设计中已知的优化原则已不足以实施高性能代码,原因是硬件设计中存在根本不同的方面,如深管、分布式存储资源和可缩放式路由。为了缓解这一点,我们收集了高性能计算机(HLS)优化改造的集合,针对高性能计算机(HPC)应用程序的可缩放和高效结构。我们系统地确定了转换的类别(管道、可缩放和记忆)、对高性能代码及其产生的硬件(例如,增加数据再利用或资源消耗)的影响,以及每次转换的目标(例如,解决与可缩放的工程师之间的交叉争议,或增加平行路由)。我们展示了这些参考文献,以便高效利用高性管和高效的架构结构结构,在可扩展的轨道上,为我们提供了一种可量化的系统化的系统化的流程,从而建立可量化的流程,为我们提供了一种可扩展的系统化的流程的流程,从而提供可量化的流程的流程,从而提供一种可量化的流程。