Over the past few years, there has been an increased interest in using FPGAs alongside CPUs and GPUs in high-performance computing systems and data centers. This trend has led to a push toward the use of high-level programming models and libraries, such as OpenCL, both to lower the barriers to the adoption of FPGAs by programmers unfamiliar with hardware description languages (HDLs), and to allow to seamlessly deploy a single code on different devices. Today, both Intel and Xilinx (now part of AMD) offer toolchains to compile OpenCL code onto FPGA. However, using OpenCL on FPGAs is complicated by performance portability issues, since different devices have fundamental differences in architecture and nature of hardware parallelism they offer. Hence, platform-specific optimizations are crucial to achieving good performance across devices. In this paper, we propose using the feed-forward design model based on pipes in order to improve the performance of OpenCL codes running on FPGA. We show the code transformations required to apply this method to existing OpenCL kernels, and we discuss the restrictions to its applicability. Using popular benchmark suites and microbenchmarks, we show that the feed-forward design model can result in higher utilization of the global memory bandwidth available and increased instruction concurrency, thus improving the overall throughput of the OpenCL implementations at a modest resource utilization cost. Further concurrency can be achieved by using multiple producers and multiple consumers.
翻译:过去几年来,人们越来越有兴趣在高性能计算系统和数据中心使用功能化计算机系统及功能化计算机系统及GPU的同时使用FPGA系统;这一趋势导致推动使用高级编程模型和图书馆,如OpenCL,以降低不熟悉硬件描述语言(HDLs)的程序员采用FPGA系统的障碍,并允许对不同装置无缝地部署单一代码。今天,Intel和Xilinx(现为AMD的一部分)都提供将OpenCL代码编成FPGA的工具链。然而,在功能性能问题下,使用OpenCLCFGA系统时,使用OpenCLL系统变得复杂,因为不同的设备在结构和硬件平行性质上存在根本差异。因此,具体平台的优化对于实现各种装置的良好性能至关重要。在本文件中,我们提议使用基于管道的供餐式前方设计模型,以改进运行于FPGGA的 Open CL码的性能将这一方法应用于现有的 OpenCL Kenne内核,我们讨论多种可操作性限制,因此使用高额化的系统,从而显示可使用大众基准和可改进的系统。