Transformers have become the backbone of modern AI, yet their high computational demands pose critical system challenges. While sparse training offers efficiency gains, existing methods fail to preserve critical structural relationships between weight matrices that interact multiplicatively in attention and feed-forward layers. This oversight leads to performance degradation at high sparsity levels. We introduce EcoSpa, an efficient structured sparse training method that jointly evaluates and sparsifies coupled weight matrix pairs, preserving their interaction patterns through aligned row/column removal. EcoSpa introduces a new granularity for calibrating structural component importance and performs coupled estimation and sparsification across both pre-training and fine-tuning scenarios. Evaluations demonstrate substantial improvements: EcoSpa enables efficient training of LLaMA-1B with 50\% memory reduction and 21\% faster training, achieves $2.2\times$ model compression on GPT-2-Medium with $2.4$ lower perplexity, and delivers $1.6\times$ inference speedup. The approach uses standard PyTorch operations, requiring no custom hardware or kernels, making efficient transformer training accessible on commodity hardware.
翻译:Transformer已成为现代人工智能的基石,但其高计算需求带来了严峻的系统挑战。尽管稀疏训练能够提升效率,现有方法未能保持注意力机制和前馈层中通过乘法交互的权重矩阵之间的关键结构关系。这种疏忽导致在高稀疏度下性能显著下降。本文提出EcoSpa——一种高效的结构化稀疏训练方法,通过联合评估并稀疏化耦合的权重矩阵对,采用对齐的行/列移除策略保持其交互模式。EcoSpa引入了一种新的粒度来校准结构组件重要性,并在预训练与微调场景中执行耦合估计与稀疏化操作。实验评估表明显著改进:EcoSpa能以50%内存缩减和21%训练加速高效训练LLaMA-1B模型,在GPT-2-Medium上实现$2.2\\times$模型压缩且困惑度降低$2.4$,同时带来$1.6\\times$推理加速。该方法仅使用标准PyTorch运算,无需定制硬件或内核,使得在通用硬件上实现高效Transformer训练成为可能。