We present a novel, hardware-agnostic implementation strategy for lattice Boltzmann (LB) simulations, which yields massive performance on homogeneous and heterogeneous many-core platforms. Based solely on C++17 Parallel Algorithms, our approach does not rely on any language extensions, external libraries, vendor-specific code annotations, or pre-compilation steps. Thanks in particular to a recently proposed GPU back-end to C++17 Parallel Algorithms, it is shown that a single code can compile and reach state-of-the-art performance on both many-core CPU and GPU environments for the solution of a given non trivial fluid dynamics problem. The proposed strategy is tested with six different, commonly used implementation schemes to test the performance impact of memory access patterns on different platforms. Nine different LB collision models are included in the tests and exhibit good performance, demonstrating the versatility of our parallel approach. This work shows that it is less than ever necessary to draw a distinction between research and production software, as a concise and generic LB implementation yields performances comparable to those achievable in a hardware specific programming language. The results also highlight the gains of performance achieved by modern many-core CPUs and their apparent capability to narrow the gap with the traditionally massively faster GPU platforms. All code is made available to the community in form of the open-source project "stlbm", which serves both as a stand-alone simulation software and as a collection of reusable patterns for the acceleration of pre-existing LB codes.
翻译:我们为 lattice Boltzmann (LB) 模拟提供了一个创新的硬件操作策略, 它在单一和多样的多核心平台上产生巨大的性能。 完全基于 C++17 平行的分类法, 我们的方法并不依赖任何语言扩展、 外部图书馆、 供应商专用代码说明或预编步骤。 特别是由于最近提出的 GPU 后端到 C+17 平行的分类法, 我们发现, 单一代码可以在多核心CPU 和 GPU 环境中汇编并达到最先进的性能, 以解决特定非微不足道的流动动态动态问题。 仅以 C++17 平行的分类法平台为基础, 我们的方法并不依赖任何不同的语言扩展、 外部图书馆、 供应商专用代码说明或预编译步骤。 这项工作表明, 一种简明和通用的LB 执行法在多种硬件特定编程语言中都具有可实现的可比较性能。 通常使用的GPO平台, 也显示其业绩成绩在测试中所取得的巨大成绩, 。