Optimizing data movements is becoming one of the biggest challenges in heterogeneous computing to cope with data deluge and, consequently, big data applications. When creating specialized accelerators, modern high-level synthesis (HLS) tools are increasingly efficient in optimizing the computational aspects, but data transfers have not been adequately improved. To combat this, novel architectures such as High-Bandwidth Memory with wider data busses have been developed so that more data can be transferred in parallel. Designers must tailor their hardware/software interfaces to fully exploit the available bandwidth. HLS tools can automate this process, but the designer must follow strict coding-style rules. If the bus width is not evenly divisible by the data width (e.g., when using custom-precision data types) or if the arrays are not power-of-two length, the HLS-generated accelerator will likely not fully utilize the available bandwidth, demanding even more manual effort from the designer. We propose a methodology to automatically find and implement a data layout that, when streamed between memory and an accelerator, uses a higher percentage of the available bandwidth than a naive or HLS-optimized design. We borrow concepts from multiprocessor scheduling to achieve such high efficiency.
翻译:优化数据移动正在成为多样化计算中的最大挑战之一,以适应数据巨量,从而成为大数据应用。在创建专门的加速器时,现代高级合成工具在优化计算方面越来越高效,但数据传输没有得到充分改进。为此,开发了高宽存储存储器等具有更广泛数据总流的新型结构,这样可以同时传输更多的数据。设计者必须调整硬件/软件界面,以充分利用可用的带宽。 HLS工具可以自动安装这个过程,但设计者必须遵循严格的编码风格规则。如果总线宽因数据宽度(例如,使用自定义精密数据类型)或阵列不具有2长的功能而不能平均地显示,则HLS生成的加速器加速器可能无法充分利用可用的带宽,要求设计者做出更多的手工努力。我们建议一种方法,在存储和加速器之间流流到自动找到并安装数据布局,但设计必须遵循严格的编码风格规则。如果总线宽度不因数据宽度(例如,使用自定义精度数据精度数据精度数据类型)或如果阵列不能够从HLS进入高频段或天性设计,则使用高频段,则使用高频段。