Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of expressivity: whereas a dense layer can realize every possible mask obtained by choosing any $w$ active weights out of $n$, a fixed block or N:M layout explores only a subset of those possibilities. We propose to close this gap by learning, for each layer, a single permutation matrix jointly with the structured weight matrix. Applied to three canonical structures -- block, N:M, and diagonals -- we show that permutation-augmented DST (PA-DST) matches unstructured baselines (RigL, SET) at 90--95\% sparsity on ImageNet-1K (ViT-B/16) and WikiText-103 (GPT-2), yet trains up to $1.21\times$ and infers up to $2.9\times$ faster. The results position structure + learned permutation as a sweet spot between accuracy and efficiency.
翻译:结构化稀疏性在现代GPU上加速了训练和推理,但其准确率仍落后于非结构化动态稀疏训练(DST)。这一差距源于表达能力的损失:稠密层可以实现从n个权重中选择任意w个活跃权重所获得的所有可能掩码,而固定的块结构或N:M布局仅探索了这些可能性中的一个子集。我们提出通过为每一层学习一个单一的置换矩阵,并与结构化权重矩阵联合训练,来弥合这一差距。应用于三种典型结构——块结构、N:M结构和对角线结构——我们证明了置换增强的DST(PA-DST)在ImageNet-1K(ViT-B/16)和WikiText-103(GPT-2)数据集上,于90-95%稀疏度下,其准确率与非结构化基线方法(RigL, SET)相当,同时训练速度提升高达$1.21\times$,推理速度提升高达$2.9\times$。这些结果表明,结构化+学习型置换是准确率与效率之间的一个理想平衡点。