We present a novel, hardware-agnostic implementation strategy for lattice Boltzmann (LB) simulations, which yields massive performance on homogeneous and heterogeneous many-core platforms. Based solely on C++17 Parallel Algorithms, our approach does not rely on any language extensions, external libraries, vendor-specific code annotations, or pre-compilation steps. Thanks in particular to a recently proposed GPU back-end to C++17 Parallel Algorithms, it is shown that a single code can compile and reach state-of-the-art performance on both many-core CPU and GPU environments for the solution of a given non trivial fluid dynamics problem. The proposed strategy is tested with six different, commonly used implementation schemes to test the performance impact of memory access patterns on different platforms. Nine different LB collision models are included in the tests and exhibit good performance, demonstrating the versatility of our parallel approach. This work shows that it is less than ever necessary to draw a distinction between research and production software, as a concise and generic LB implementation yields performances comparable to those achievable in a hardware specific programming language. The results also highlight the gains of performance achieved by modern many-core CPUs and their apparent capability to narrow the gap with the traditionally massively faster GPU platforms. All code is made available to the community in form of the open-source project stlbm, which serves both as a stand-alone simulation software and as a collection of reusable patterns for the acceleration of pre-existing LB codes.
翻译:我们为 lattice Boltzmann (LB) 模拟提供了一个创新的硬件操作策略, 它可以在单一和多样化的多核心平台上产生巨大的性能。 完全基于 C++17 平行算法, 我们的方法并不依赖任何语言扩展、 外部图书馆、 供应商专用代码说明或预编步骤。 特别是由于最近提出的 GPU 后端到 C+17 平行算法, 我们发现, 单一代码可以在多核心 CPU 和 GPU 环境中汇编并达到最先进的性能, 以解决特定非微不足道的流动动态动态问题。 仅以 C++17 平行算法为基础, 我们的方法并不依赖任何语言扩展、 外部图书馆、 供应商专用代码说明或预编译步骤。 特别是最近提出的 GPU 后端模型, 显示我们平行方法的多功能性能。 这项工作表明, 将研究软件和生产软件的简单和通用LB 执行过程的性能与硬件特定收集流动动态问题相似。 拟议战略的六种不同的实施方法, 也显示其最先进的CPOL 格式的成绩, 。