TL：面向空间数据流架构的基于分块语言的自动端到端编译器 (TL: Automatic End-to-End Compiler of Tile-Based Languages for Spatial Dataflow Architectures)

Spatial dataflow accelerators are a promising direction for next-generation computer systems because they can reduce the memory bottlenecks of traditional von Neumann machines such as CPUs and GPUs. They do so by organizing computation around explicit, compiler-managed data movement over the on-chip network, allowing operands to be directly forwarded between processing elements and reducing reliance on high-latency, bandwidth-limited global shared memory. Such localized communications can provide higher throughput and efficiency compared to repeated off-chip memory accesses. However, their end-to-end performance depends strongly on how workloads are mapped to the hardware. Naive mappings can perform very poorly, and most users rely on hand-tuned vendor libraries. In practice, although existing spatial-dataflow accelerators have strong potential for high performance, energy- and cost-efficiency, their limited programmability remains a major barrier to their wider adoption. This paper presents TL, an end-to-end framework that compiles tile-based programs (such as Triton kernels) onto spatial dataflow architectures. Unlike most existing compiler frameworks that focus on optimizing code generation within a single tile, TL addresses the central challenge of distributing tile instances across spatially distributed cores and exploiting the on-chip network and distributed memories to increase data reuse and reduce communications. TL proposes a hardware representation that captures interconnect topology, memory hierarchy, and compute capabilities, enabling both specialized architecture-specific optimizations and support for diverse spatial dataflow targets. TL is built on the MLIR ecosystem and defines a generic entry point for different front-ends and an end point for different back-ends.

翻译：空间数据流加速器是下一代计算机系统的有前景方向，因其能够缓解传统冯·诺依曼架构机器（如CPU和GPU）的内存瓶颈。该架构通过围绕片上网络上的显式编译器管理数据移动来组织计算，使操作数能在处理单元间直接转发，并减少对高延迟、带宽受限的全局共享内存的依赖。与重复的片外内存访问相比，此类局部通信能提供更高的吞吐量和能效。然而，其端到端性能在很大程度上取决于工作负载如何映射至硬件。简单的映射可能表现极差，多数用户依赖手动调优的厂商库。实践中，尽管现有空间数据流加速器具备高性能、高能效与成本效益的潜力，但其有限的可编程性仍是广泛采用的主要障碍。本文提出TL——一个将基于分块的程序（如Triton内核）编译至空间数据流架构的端到端框架。与大多数专注于优化单个分块内代码生成的现有编译器框架不同，TL致力于解决核心挑战：将分块实例分布到空间分布的计算核心上，并利用片上网络与分布式内存以增加数据复用并减少通信。TL提出了一种硬件表示方法，可捕获互连拓扑、内存层次与计算能力，从而支持针对特定架构的专门优化以及对多样化空间数据流目标的兼容性。TL基于MLIR生态系统构建，为不同前端定义了通用入口点，并为不同后端定义了出口点。