Nonuniform fast Fourier transforms dominate the computational cost in many applications including image reconstruction and signal processing. We thus present a general-purpose GPU-based CUDA library for type 1 (nonuniform to uniform) and type 2 (uniform to nonuniform) transforms in dimensions 2 and 3, in single or double precision. It achieves high performance for a given user-requested accuracy, regardless of the distribution of nonuniform points, via cache-aware point reordering, and load-balanced blocked spreading in shared memory. At low accuracies, this gives on-GPU throughputs around $10^9$ nonuniform points per second, and (even including host-device transfer) is typically 4-10$\times$ faster than the latest parallel CPU code FINUFFT (at 28 threads). It is competitive with two established GPU codes, being up to 90$\times$ faster at high accuracy and/or type 1 clustered point distributions. Finally we demonstrate a 6-18$\times$ speedup versus CPU in an X-ray diffraction 3D iterative reconstruction task at $10^{-12}$ accuracy, observing excellent multi-GPU weak scaling up to one rank per GPU.
翻译:Fourier 的不统一快速 Fleier 变换在包括图像重建与信号处理在内的许多应用中的计算成本中占主导地位。 因此,我们为1型(不统一为制服)和2型(统一为不统一)在2和3号尺寸上以单精度或双精度变换了通用GPU的基于GPU的CUDA图书馆(统一为制服)和2型(统一为不统一),单一或双精度变异。它为特定用户要求的准确度实现了高性能,而不论非统一点的分布如何,通过缓存点重新排序和在共享记忆中阻隔断的负载平衡,均能为非统一点的分布。 在低封闭度的情况下,这为每秒约10-9美元的GPUPU, 通常为4-10美元/10美元/10美元/10美元(即使包括主机/ 美元), 与最新的平行的CPU代码FINFUFFFFT(28线)相比,通常更快。 它与两个既定的GPU码代码具有竞争力,在高达90/ 和/ 或第1类集点分布分布分布上显示6-18的加速与CPUPUPUCPUPU的加速度。 最后,在X- 10/12 一级上观测到最弱的GDDDDDrel 的精确度任务上显示的精确度任务。