Leveraging the SIMD capability of modern CPU architectures is mandatory to take full benefit of their increasing performance. To exploit this feature, binary executables must be explicitly vectorized by the developers or an automatic vectorization tool. This why the compilation research community has created several strategies to transform a scalar code into a vectorized implementation. However, the majority of the approaches focus on regular algorithms, such as affine loops, that can be vectorized with few data transformations. In this paper, we present a new approach that allow automatically vectorizing scalar codes with chaotic data accesses as long as their operations can be statically inferred. We describe how our method transforms a graph of scalar instructions into a vectorized one using different heuristics with the aim of reducing the number or cost of the instructions. Finally, we demonstrate the interest of our approach on various computational kernels using Intel AVX-512 and ARM SVE.
翻译:要充分利用现代CPU结构的SIMD能力,必须充分利用其不断提高的性能。要利用这一特性,必须由开发者或自动矢量化工具对二进制执行器进行明确的矢量化。因此,汇编研究界制定了若干战略,将一个星标代码转化为矢量化执行。然而,大多数方法侧重于常规算法,例如能够通过少量数据转换而向量化的“类链环”。本文介绍了一种新的方法,允许在数据存取的混乱数据存取器中自动向量化星标码,只要其操作可以静态地推断。我们描述了我们的方法如何使用不同的超量化法将一个星标指示图转换成一个矢量化的星标,目的是减少指示的数量或成本。最后,我们展示了我们使用Intel AVX-512和ARM SVE对各种计算内核子的兴趣。