In an effort to lower the barrier to the adoption of FPGAs by a broader community, today major FPGA vendors offer compiler toolchains for OpenCL code. While using these toolchain allows porting existing code to FPGAs, ensuring performance portability across devices (i.e., CPUs, GPUs and FPGAs) is not a trivial task. This is in part due to the different hardware characteristics of these devices, including the nature of the hardware parallelism and the memory bandwidth they offer. In particular, global memory accesses are known to be one of the main performance bottlenecks for OpenCL kernels deployed on FPGA. In this paper, we investigate the use of pipes to improve memory bandwidth utilization and performance of OpenCL kernels running on FPGA. This is done by separating the global memory accesses from the computation, enabling better use of the load units required to access global memory. We perform experiments on a set of broadly used benchmark applications with various compute and memory access patterns. Our experiments, conducted on an Intel Arria GX board, show that the proposed method is effective in improving the memory bandwidth utilization of most kernels, particularly those exhibiting irregular memory access patterns. This, in turn, leads to performance improvements, in some cases significant.
翻译:为了努力降低更广大社区采用FPGA的障碍,今天,FPGA的主要供应商为OpenCL码提供了编译工具链。虽然使用这些工具链可以将现有代码移植到FPGAs,但确保跨设备(即CPU、GPU和FPGAs)的性能可移动性并不是一项微不足道的任务,部分原因是这些装置的硬件特性不同,包括硬件平行性质和它们提供的记忆带宽。特别是,众所周知,全球记忆存取是FPGA上安装的 OpenCL内核的主要性能瓶颈之一。在本文中,我们调查如何使用管道改进OpenCL内核的内核利用和性能,这是通过将全球记忆存取与计算分开,从而能够更好地利用全球记忆所需的负载器。我们实验了一套广泛使用的基准应用程序,这些应用有各种精确和记忆存取模式。我们在Intel Arima GX 板上进行的实验表明,拟议的方法在改进记忆存取模式方面最为有效,特别是这些正常的利用。