Pruning large neural networks to create high-quality, independently trainable sparse masks, which can maintain similar performance to their dense counterparts, is very desirable due to the reduced space and time complexity. As research effort is focused on increasingly sophisticated pruning methods that leads to sparse subnetworks trainable from the scratch, we argue for an orthogonal, under-explored theme: improving training techniques for pruned sub-networks, i.e. sparse training. Apart from the popular belief that only the quality of sparse masks matters for sparse training, in this paper we demonstrate an alternative opportunity: one can carefully customize the sparse training techniques to deviate from the default dense network training protocols, consisting of introducing ``ghost" neurons and skip connections at the early stage of training, and strategically modifying the initialization as well as labels. Our new sparse training recipe is generally applicable to improving training from scratch with various sparse masks. By adopting our newly curated techniques, we demonstrate significant performance gains across various popular datasets (CIFAR-10, CIFAR-100, TinyImageNet), architectures (ResNet-18/32/104, Vgg16, MobileNet), and sparse mask options (lottery ticket, SNIP/GRASP, SynFlow, or even randomly pruning), compared to the default training protocols, especially at high sparsity levels. Code is at https://github.com/VITA-Group/ToST
翻译:由于空间和时间复杂性降低,因此非常可取。由于研究工作侧重于日益精密的修剪方法,导致从头到尾可以训练的亚网络稀少,因此我们主张采用一个正统的、探索不足的主题:改进对经修剪的子网络的培训技术,即培训稀少。除了普遍认为只有稀少培训需要的稀少面具质量,我们在本文件中展示了一个替代机会:人们可以仔细定制稀少的培训技术,以偏离默认的密集网络培训协议,包括引进“鬼”神经元和在培训的早期阶段断开连接,并从战略上修改初始化和标签。我们新的稀少培训配方通常适用于用各种稀少的面具改进从零培训。我们采用我们新整理的技术,我们展示了各种流行数据集(CIFAR-10、CIFAR-100、tiyImageNet)、架构(ResNet-18/32-TARTA、ST-ST-ROM-MOL 或MAS-Orormacol Composial Commissional Prosutions),特别是SettyP-SLOral/MOL IP/MOL 选项(SBROGROGR)。