We present Cyclotron, a framework and compiler for using recurrence equations to express streaming dataflow algorithms, which then get portably compiled to distributed topologies of interlinked processors. Our framework provides an input language of recurrences over logical tensors, which then gets lowered into an intermediate language of recurrences over logical iteration spaces, and finally into programs of send, receive, and computation operations specific to each individual processor. In Cyclotron's IR, programs are optimized such that external memory interactions are confined to the boundaries of the iteration space. Within inner iteration spaces, all data accesses become local: data accesses target values residing in local fast memory or on neighboring processing units, avoiding costly memory movement. We provide a scheduling language allowing users to define how data gets streamed and broadcasted between processors, enabling pipelined execution of computation kernels over distributed topologies of processing elements. We demonstrate the portability of our approach by compiling our IR to a reconfigurable simulator of systolic arrays and chiplet style distributed hardware, as well as to distributed-memory CPU clusters. In the simulated reconfigurable setting, we use our compiler for hardware design space exploration in which link costs and latencies can be specified. In the distributed CPU setting, we show how to use recurrences and our scheduling language to express various matrix multiplication routines (Cannon, SUMMA, PUMMA, weight stationary) and solvers (Triangular solve and Cholesky). For matrix multiplication and the triangular solve, we generate distributed implementations competitive with ScaLAPACK.
翻译:本文提出Cyclotron,一个基于递推方程表达流式数据流算法并实现可移植编译至互联处理器分布式拓扑的框架与编译器。该框架提供基于逻辑张量的递推式输入语言,通过逐级降阶转换为基于逻辑迭代空间的中间语言,最终生成针对各处理器的发送、接收与计算操作程序。在Cyclotron中间表示中,程序经优化使得外部存储器交互被限制在迭代空间边界。在内部迭代空间中,所有数据访问均实现本地化:数据访问目标为本地快速存储器或相邻处理单元中的数值,从而避免高成本的数据迁移。我们提供调度语言供用户定义处理器间的数据流传输与广播机制,支持在分布式处理单元拓扑上实现计算内核的流水线执行。通过将中间表示编译至可重构脉动阵列模拟器、小芯片式分布式硬件以及分布式内存CPU集群,我们验证了该方法的可移植性。在可重构模拟环境中,我们利用编译器进行硬件设计空间探索,支持指定链路成本与延迟参数。在分布式CPU环境中,我们展示了如何通过递推式与调度语言实现多种矩阵乘法算法(Cannon、SUMMA、PUMMA、权重驻留式)及求解器(三角求解与Cholesky分解)。针对矩阵乘法与三角求解问题,我们生成的分布式实现方案性能与ScaLAPACK库相当。