We introduce a code generator that converts unoptimized C++ code operating on sparse data into vectorized and parallel CPU or GPU kernels. Our approach unrolls the computation into a massive expression graph, performs redundant expression elimination, grouping, and then generates an architecture-specific kernel to solve the same problem, assuming that the sparsity pattern is fixed, which is a common scenario in many applications in computer graphics and scientific computing. We show that our approach scales to large problems and can achieve speedups of two orders of magnitude on CPUs and three orders of magnitude on GPUs, compared to a set of manually optimized CPU baselines. To demonstrate the practical applicability of our approach, we employ it to optimize popular algorithms with applications to physical simulation and interactive mesh deformation.
翻译:我们引入一个代码生成器, 将稀有数据操作的未优化 C++ 代码转换成矢量化和平行的 CPU 或 GPU 内核。 我们的方法将计算结果卷进一个大表达式图, 进行多余的表达式删除、 分组, 然后生成一个特定架构的内核来解决同样的问题, 假设聚度模式是固定的, 这是计算机图形和科学计算中许多应用中常见的情景 。 我们展示了我们的方法对于大问题的规模, 可以在CPU上实现两个数量级的加速, 在GPU上实现三个数量级的加速, 与一组手工优化的CPU基线相比。 为了展示我们的方法的实际适用性, 我们使用它来优化大众算法, 其应用到物理模拟和互动网形变。