As Large Language Models (LLMs) scale to longer context windows, the computational cost of attention mechanisms, which traditionally grows quadratically with input length, presents a critical challenge for real-time and memory-constrained deployments. Existing sparse attention techniques have sought to reduce this complexity, but they often incur significant overhead or compromise accuracy, making them less practical for large contexts on mid-range hardware. In this paper, we introduce SparseAccelerate, a dynamic sparse attention method that adapts its sparsity patterns based on input characteristics, effectively flattening the attention complexity curve. Our approach is effective for input lengths starting at 16K tokens and scales efficiently up to 128K tokens on dual NVIDIA A5000 GPUs (24GB each). Experimental results show that SparseAccelerate achieves up to a 1.04x reduction in Time-To-First-Token (TTFT) latency at 32K tokens, while also providing substantial memory savings. These improvements yield practical gains for memory-intensive applications and long-context tasks that were previously infeasible with standard attention. Beyond latency reductions, SparseAccelerate fundamentally shifts the scaling trend, demonstrating the smallest TTFT growth gradient relative to context length among competing methods. Ongoing evaluations on diverse benchmarks confirm its scalability, positioning SparseAccelerate as a critical advancement toward efficient, real-time, and large-context LLM inference on accessible hardware.
翻译:暂无翻译