The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized. To close this performance gap we introduce C-For-Metal (CM), an explicit SIMD programming framework designed to deliver close-to-the-metal performance on Intel GPUs. The CM programming language and its vector/matrix types provide an intuitive interface to exploit the underlying hardware features, allowing fine-grained register management, SIMD size control and cross-lane data sharing. Experimental results show that CM applications from different domains outperform the best-known SIMT-based OpenCL implementations, achieving up to 2.7x speedup on the latest Intel GPU.
翻译:SIMT执行模式通常用于一般 GPU 开发。 CUDA 和 OpenCL 开发者编写由编译器和硬件暗含平行的标码。 但是,在 Intel GPU 上,这种抽象性具有深刻的性能影响,因为ISSA 的根基是SIMD, 重要的硬件能力无法充分利用。要缩小这一性能差距,我们引入C-For-Metal(CM), 即一个明确的SIMD编程框架, 目的是在 Intel GPU 上提供近距离到金属的性能。 CMD 编程语言及其矢量/矩阵类型提供了一个直观界面, 以利用基本硬件特征, 允许精细的注册管理、 SIMD 尺寸控制和跨链数据共享。 实验结果表明, 不同领域的CMD应用程序比最著名的SIMT- OpenCL 执行系统(CMT), 在最新的 Intel GPU 上达到2.7x 速度。