We present Spartan, a method for training sparse neural network models with a predetermined level of sparsity. Spartan is based on a combination of two techniques: (1) soft top-k masking of low-magnitude parameters via a regularized optimal transportation problem and (2) dual averaging-based parameter updates with hard sparsification in the forward pass. This scheme realizes an exploration-exploitation tradeoff: early in training, the learner is able to explore various sparsity patterns, and as the soft top-k approximation is gradually sharpened over the course of training, the balance shifts towards parameter optimization with respect to a fixed sparsity mask. Spartan is sufficiently flexible to accommodate a variety of sparsity allocation policies, including both unstructured and block structured sparsity, as well as general cost-sensitive sparsity allocation mediated by linear models of per-parameter costs. On ImageNet-1K classification, Spartan yields 95% sparse ResNet-50 models and 90% block sparse ViT-B/16 models while incurring absolute top-1 accuracy losses of less than 1% compared to fully dense training.
翻译:我们介绍了Spartan, 这是一种培训稀有神经网络模型的方法,具有预先确定的宽度。 Spartan 是基于两种技术的组合:(1) 通过正规化的最佳运输问题,对低磁度参数进行软顶层遮掩,和(2) 以平均参数进行双重更新,在前方路口进行硬吸附。这个办法实现了探索-开发权衡:在培训初期,学习者能够探索各种聚度模式,随着软顶部近距离在培训过程中逐渐加强,在固定的聚层遮掩方面,平衡转向参数优化。 Spartan 足够灵活,能够适应各种宽度分配政策,包括非结构化和块状结构散射,以及一般成本敏感的宽度分配,以单方路面成本模型为媒介。在图像Net-1K分类中,Spartan 生成了95%的稀薄ResNet-50模型和90%块稀释的VIT-B/16模型,而与完全密集的培训相比,绝对的顶端-1精确损失不到1%。