The growing size of large language models has created significant computational inefficiencies. To address this challenge, sparse activation methods selectively deactivates non-essential parameters during inference, reducing computational costs in FFNN layers. While existing methods focus on non-linear gating mechanisms, we hypothesize that the sparsity of the FFNN layer lies globally in the form of a linear combination over its internal down projection matrix. Based on this insight, we propose two methods: M-COUNTDOWN, leveraging indirect coefficients, and D-COUNTDOWN, utilizing direct coefficients of the linear combination. Experimental results demonstrate that D-COUNTDOWN can omit 90% of computations with performance loss as low as 5.5% ideally, while M-COUNTDOWN provides a predictor-free solution with up to 29.4% better performance preservation compared to existing methods. Our specialized kernel implementations effectively realize these theoretical gains into substantial real-world acceleration.
翻译:大型语言模型规模的持续增长带来了显著的计算效率问题。为应对这一挑战,稀疏激活方法在推理过程中选择性地停用非关键参数,从而降低前馈神经网络(FFNN)层的计算开销。现有方法主要关注非线性门控机制,而本文假设FFNN层的稀疏性在全局上体现为其内部下投影矩阵的线性组合形式。基于这一洞见,我们提出了两种方法:M-COUNTDOWN(利用间接系数)与D-COUNTDOWN(采用直接线性组合系数)。实验结果表明,D-COUNTDOWN在理想情况下可省略90%计算量且性能损失低至5.5%,而M-COUNTDOWN作为无需预测器的解决方案,其性能保持能力较现有方法提升最高达29.4%。我们设计的专用内核实现有效将这些理论优势转化为实际场景中的显著加速效果。