Over the past few years, there has been an increased interest in including FPGAs in data centers and high-performance computing clusters along with GPUs and other accelerators. As a result, it has become increasingly important to have a unified, high-level programming interface for CPUs, GPUs and FPGAs. This has led to the development of compiler toolchains to deploy OpenCL code on FPGA. However, the fundamental architectural differences between GPUs and FPGAs have led to performance portability issues: it has been shown that OpenCL code optimized for GPU does not necessarily map well to FPGA, often requiring manual optimizations to improve performance. In this paper, we explore the use of thread coarsening - a compiler technique that consolidates the work of multiple threads into a single thread - on OpenCL code running on FPGA. While this optimization has been explored on CPU and GPU, the architectural features of FPGAs and the nature of the parallelism they offer lead to different performance considerations, making an analysis of thread coarsening on FPGA worthwhile. Our evaluation, performed on our microbenchmarks and on a set of applications from open-source benchmark suites, shows that thread coarsening can yield performance benefits (up to 3-4x speedups) to OpenCL code running on FPGA at a limited resource utilization cost.
翻译:过去几年来,人们越来越有兴趣将FPGAs纳入数据中心和高性能计算群集以及GPU和其他加速器的高性能计算组,因此,为CPU、GPU和FPGAs建立一个统一的高级编程界面变得日益重要。这导致开发了编译工具链,在FPGA上安装OpenCL代码。然而,GPU和FPGAs之间的基本结构差异导致了可移植性问题:已经表明,GPU优化的OpenCL代码不一定能向FPGA绘制好的地图,往往需要手工优化来改进业绩。在本文中,我们探索了使用螺线拼拼拼图的高级编程界面界面,将多线的作品合并成一条线-在FPGA、FPGA和FPGA之间的基本结构特征以及它们提供的平行性质导致不同的业绩考量,对FPGA的有限性能、对FPGA应用进行精确度分析,我们从运行的FPGA和Siriral的SMR(我们关于S-Sirimal IMFA的预估能的微业绩评估)的预估测算。