The 3D Discrete Fourier Transform (DFT) is a technique used to solve problems in disparate fields. Nowadays, the commonly adopted implementation of the 3D-DFT is derived from the Fast Fourier Transform (FFT) algorithm. However, evidence indicates that the distributed memory 3D-FFT algorithm does not scale well due to its use of all-to-all communication. Here, building on the work of Sedukhin \textit{et al}. [Proceedings of the 30th International Conference on Computers and Their Applications, CATA 2015 pp. 193-200 (01 2015)], we revisit the possibility of improving the scaling of the 3D-DFT by using an alternative approach that uses point-to-point communication, albeit at a higher arithmetic complexity. The new algorithm exploits tensor-matrix multiplications on a volumetrically decomposed domain via three specially adapted variants of Cannon's algorithm. It has here been implemented as a C++ library called S3DFT and tested on the JUWELS Cluster at the J\"ulich Supercomputing Center. Our implementation of the shared memory tensor-matrix multiplication attained 88\% of the theoretical single node peak performance. One variant of the distributed memory tensor-matrix multiplication shows excellent scaling, while the other two show poorer performance, which can be attributed to their intrinsic communication patterns. A comparison of S3DFT with the Intel MKL and FFTW3 libraries indicates that currently iMKL performs best overall, followed in order by FFTW3 and S3DFT. This picture might change with further improvements of the algorithm and/or when running on clusters that use network connections with higher latency, e.g. on cloud platforms.
翻译:3D离散傅里叶变换(DFT)是解决不同领域问题的技术。现今,广泛采用的3D-DFT实现源自快速傅里叶变换(FFT)算法。然而,证据表明分布式内存3D-FFT算法由于全部对全部通信的使用而无法良好扩展。在此,我们在Sedukhin等人的工作[Proceedings of the 30th International Conference on Computers and Their Applications, CATA 2015 pp. 193-200 (01 2015)]的基础上,重新考虑通过使用点对点通信的另一种方法来提高3D-DFT的扩展性,尽管算术复杂度更高。新算法通过三个经过特别适应的Cannon算法变体,在分解的体积域上利用张量-矩阵乘法。它已经作为一个名为S3DFT的C ++库实现,并在Jülich超级计算中心的JUWELS Cluster上进行了测试。我们实现的共享内存张量-矩阵乘法达到了理论单节点峰值性能的88%。一个分布式内存张量-矩阵乘法的变体展现出极佳的扩展性,而其他两个的性能较差,这可以归因于它们固有的通信模式。将S3DFT与Intel MKL和FFTW3库进行比较表明,目前iMKL在整体上表现最好,其次是FFTW3和S3DFT。随着算法的进一步改进和/或在使用延迟更高的网络连接的群集上运行(例如在云平台上),情况可能会发生变化。