Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting, providing end-to-end analysis and mapping from a high-level program description to distributed hardware. We evaluate our generated architectures on a Stratix 10 FPGA testbed, yielding 1.31 TOp/s and 4.18 TOp/s on single-device and multi-device, respectively, demonstrating the highest performance recorded for stencil programs on FPGAs to date. We then leverage the framework to study a complex stencil program from a production weather simulation application. Our work enables productively targeting distributed spatial computing systems with large stencil programs, and offers insight into architecture characteristics required for their efficient execution in practice.
翻译:事实证明,空间计算装置大大加快了Stencils计算速度,但迄今为止依靠将单一Stencils操作的迭代维维度解开来增加时间定位。 这项工作考虑了绘图将多种Stencils计算过程的循环图引导到空间计算系统的一般案例,假设大型输入程序没有迭代组件,StencilsFlow将时间位置最大化,确保这一环境的僵局自由,从一个高级程序描述到分布式硬件,提供端到端的分析和绘图。 我们用Stratix 10 FPGA测试台评估了我们生成的架构,分别生成了1.31 TOp/s和4.18 Top/s的单设备与多设备,显示了迄今为止在FPGAs上记录到的Stencils程序的最高性能。 然后我们利用这一框架从生产天气模拟应用程序中研究一个复杂的Stencils程序。 我们的工作能够卓有成效地将分布式的空间计算系统与大型Stencils程序进行定位,并深入了解其在实践中的高效执行所需的结构特征。