Conventional self-attention mechanisms incur quadratic complexity, limiting their scalability on long sequences. We introduce \textbf{FFTNet}, an adaptive spectral filtering framework that leverages the Fast Fourier Transform (FFT) to achieve global token mixing in $\mathcal{O}(n\log n)$ time. By transforming inputs into the frequency domain, FFTNet exploits the orthogonality and energy preservation guaranteed by Parseval's theorem to capture long-range dependencies efficiently. Our main theoretical contributions are 1) an adaptive spectral filter, 2) combining local windowing with a global FFT branch, and 3) rich nonlinearity introduction in both the frequency and token domains. Experiments on the Long Range Arena and ImageNet benchmarks validate our theoretical insights and demonstrate superior performance over fixed Fourier and standard attention models.
翻译:暂无翻译