We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a $n^3$ problem with up to $n^2$ PEs. At this point a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages. This high efficiency on fine-grain communication allow wsFFT to achieve unprecedented levels of parallelism and performance. We analyse in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a $512^3$ complex input array using a 512x512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier.
翻译:我们在Cerebras CS-2 上实施了快速的 Fleier 变换 Fleier, 1, 2 和 3 维格阵列, 该系统的内存和处理元素都位于一个单硅丝网。 wafer 级引擎( WSE) 包含一个大约850 000 个处理元件( PES) 的二维网格。 在超脚栏之间, WsFFT 的再分配( 转换) 数据可以将每个一维铅笔的所有元素转换成一个 PE 的记忆。 每次再分配会引发一个高达$n2$PE 的峰值问题。 在目前这个点, 3D 域域( 称为铅笔) 的单个矢量和处理元盘的单个矢量, 3D 域域域( 被称为铅笔) 的单个矢量矢量矢量的矢量的矢量, 三个超脚步列的运行FFFFFT 的大小 。 在超脚列之间, 使用一维格的电路路段之间, 使用双向我方FIFIFT 。