Stencil computation is one of the most important kernels in various scientific and engineering applications. A variety of work has focused on vectorization and tiling techniques, aiming at exploiting the in-core data parallelism and data locality respectively. In this paper, the downsides of existing vectorization schemes are analyzed. Briefly, they either incur data alignment conflicts or hurt the data locality when integrated with tiling. Then we propose a novel transpose layout to preserve the data locality for tiling and reduce the data reorganization overhead for vectorization simultaneously. To further improve the data reuse at the register level, a time loop unroll-and-jam strategy is designed to perform multistep stencil computation along the time dimension. Experimental results on the AVX-2 and AVX-512 CPUs show that our approach obtains a competitive performance.
翻译:Stencils 计算是各种科学和工程应用中最重要的内核之一,各种工作都集中在矢量化和砖瓦技术上,旨在分别利用核心数据平行和数据位置。本文分析了现有矢量化计划的下方。简而言之,它们要么引起数据对齐冲突,要么在与平铺合并时损害数据位置。然后,我们提出一个新的转换布局,以保存数据配置地点,同时减少矢量化的数据重组间接费用。为了进一步改善在登记一级的数据再利用,设计了一个时间循环、无滚动和干扰战略,以便在时间方面进行多步的加速计算。AVX-2和AVX-512 CPU的实验结果显示,我们的方法具有竞争性的性能。