Sparse tensor algebra computations have become important in many real-world applications like machine learning, scientific simulations, and data mining. Hence, automated code generation and performance optimizations for tensor algebra kernels are paramount. Recent advancements such as the Tensor Algebra Compiler (TACO) greatly generalize and automate the code generation for tensor algebra expressions. However, the code generated by TACO for many important tensor computations remains suboptimal due to the absence of a scheduling directive to support transformations such as distribution/fusion. This paper extends TACO's scheduling space to support kernel distribution/loop fusion in order to reduce asymptotic time complexity and improve locality of complex tensor algebra computations. We develop an intermediate representation (IR) for tensor operations called branched iteration graph which specifies breakdown of the computation into smaller ones (kernel distribution) and then fuse (loop fusion) outermost dimensions of the loop nests, while the innermost dimensions are distributed, to increase data locality. We describe exchanges of intermediate results between space iteration spaces, transformation in the IR, and its programmatic invocation. Finally, we show that the transformation can be used to optimize sparse tensor kernels. Our results show that this new transformation significantly improves the performance of several real-world tensor algebra computations compared to TACO-generated code.
翻译:在很多现实应用中,如机器学习、科学模拟和数据挖掘等,沙粒高代数变代数计算变得非常重要。 因此,对高代数内核的自动代码生成和性能优化至关重要。 最近的进展, 如 Tensor 代数编译器(TaCO), 大大概括并自动化了高代数表达式的代码生成。 然而, TACO 生成的很多重要高代数计算法的代码仍然不最优化, 原因是缺少支持分布/ 融合等转换的时间安排指令。 本文扩展了 TACO 的排程空间, 以支持内核分布/ 融合, 以便支持高代数内核分配/ 内核内核内核的配置/ 优化 。 为了减少无干扰的时间复杂性, 并改进复合的高代数计算器计算器的位置。 我们为 Exmodalation 动作开发了一个中间代号(IR), 也就是将计算结果细分成小号( 内核分布), 然后导(loop 融合) 最外的外层尺寸, 分布为增加数据位置 。 我们描述了空间变换的代数 。