While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model. We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 76% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7$\times$.
翻译:虽然平行模式仍然是业绩的主要来源,但建筑实施和编程模式随着每个新的硬件生成而变化,往往导致成本高昂的应用再造。大多数可操作性工具需要人工和昂贵的应用,将另一个编程模式移植到另一个编程模式。我们建议了一种替代方法,将一个编程模式(CUDA)中的程序自动翻译成另一个基于多功能者/MLIR的(CPU线条线)。我们的方法包括一个平行结构的表示,允许常规编译器转换透明地、不作修改地应用,并能够实现平行的优化。我们通过将CUDA Rodinia基准套件转换并优化为多核心CPU, 并实现76 % 的Geopue 速度超过手写 OpenMP 代码。此外,我们展示了PyTorrcht的CUDA核心如何在没有用户干预的情况下在CPU的唯一超级计算机 Fugaku有效运行和升级。我们的PyTorknels的兼容性层, 将转录 CUDA PyTochnelts比Pnends 复制了2.7美元。