Multi-die FPGAs are widely adopted to deploy large hardware accelerators. Two factors impede the performance optimization of HLS designs implemented on multi-die FPGAs. On the one hand, the long net delay due to nets crossing die-boundaries results in an NP-hard problem to properly floorplan and pipeline an application. On the other hand, traditional automated searching flow for HLS directive optimizations targets single-die FPGAs, and hence, it cannot consider the resource constraints on each die and the timing issue incurred by the die-crossings. Further, it leads to an excessively long runtime to legalize the floorplan of HLS designs generated under each group of configurations during directive optimization due to the large design scale. To co-optimize the directives and floorplan of HLS designs on multi-die FPGAs, we propose the FADO framework, which formulates the directive-floorplan co-search problem based on the multi-choice multi-dimensional bin-packing and solves it using an iterative optimization flow. For each step of directive search, a latency-bottleneck-guided greedy algorithm searches for more efficient directive configurations. For floorplanning, instead of repetitively incurring global floorplanning algorithms, we implement a more efficient incremental floorplan legalization algorithm. It mainly applies the worst-fit online bin-packing algorithm to balance the floorplan, together with an offline best-fit-decreasing re-packing to compact the floorplan, followed by pipelining of long wires crossing die-boundaries. Through experiments on HLS designs mixing dataflow and non-dataflow kernels, FADO not only well-automates the co-optimization and finishes within 693X~4925X shorter runtime, compared with DSE assisted by global floorplanning, but also yields an improvement of 1.16X~8.78X in overall workflow execution time after implementation on the Xilinx Alveo U250 FPGA.
翻译:多式FPGA系统被广泛采用,用于部署大型硬件加速器。 两个因素阻碍了多式FPGA系统所实施HLS设计的业绩优化。 一方面,由于网络跨越死地线造成长期净延迟,导致在多式FPGA系统上使用适当的地板平板图和管道应用程序。 另一方面,传统HLS指令的自动搜索流程以单式FPGA为目标,因此,它无法考虑每次死亡的资源限制和死地流引起的时间问题。 此外,它导致将每组配置在多式FPGA系统下产生的HLS系统设计优化。 一方面,由于设计规模大,由于网跨越死地线线线线网,造成长式净延迟,导致HLSFGA系统在每组配置下产生的楼平板平板平板平板平板图上合法化。