Kolmogorov-Arnold Networks (KANs) promise higher expressive capability and stronger interpretability than Multi-Layer Perceptron, particularly in the domain of AI for Science. However, practical adoption has been hindered by low GPU utilization of existing parallel implementations. To address this challenge, we present a GPU-accelerated operator library, named PolyKAN which is the first general open-source implementation of KAN and its variants. PolyKAN fuses the forward and backward passes of polynomial KAN layers into a concise set of optimized CUDA kernels. Four orthogonal techniques underpin the design: (i) \emph{lookup-table} with linear interpolation that replaces runtime expensive math-library functions; (ii) \emph{2D tiling} to expose thread-level parallelism with preserving memory locality; (iii) a \emph{two-stage reduction} scheme converting scattered atomic updates into a single controllable merge step; and (iv) \emph{coefficient-layout reordering} yielding unit-stride reads under the tiled schedule. Using a KAN variant, Chebyshev KAN, as a case-study, PolyKAN delivers $1.2$--$10\times$ faster inference and $1.4$--$12\times$ faster training than a Triton + cuBLAS baseline, with identical accuracy on speech, audio-enhancement, and tabular-regression workloads on both highend GPU and consumer-grade GPU.
翻译:Kolmogorov-Arnold网络(KANs)相较于多层感知机,在科学人工智能领域展现出更强的表达能力和可解释性。然而,现有并行实现的GPU利用率低下阻碍了其实际应用。为应对这一挑战,我们提出了名为PolyKAN的GPU加速算子库,这是首个通用的开源KAN及其变体实现。PolyKAN将多项式KAN层的前向与反向传播过程融合为一组精简的优化CUDA内核。其设计基于四项正交技术:(i)采用带线性插值的查找表替代运行时开销高昂的数学库函数;(ii)通过二维分块在保持内存局部性的同时暴露线程级并行性;(iii)设计两阶段归约方案,将分散的原子更新转换为单次可控的合并步骤;(iv)系数布局重排策略,在分块调度下实现单元步长读取。以切比雪夫KAN变体为案例研究表明,在高端GPU和消费级GPU上,针对语音处理、音频增强和表格回归任务,PolyKAN在保持精度相同的前提下,相比Triton+cuBLAS基线实现了1.2至10倍的推理加速和1.4至12倍的训练加速。