Training a sparse neural network from scratch requires optimizing connections at the same time as the weights themselves. Typically, the weights are redistributed after a predefined number of weight updates, removing a fraction of the parameters of each layer and inserting them at different locations in the same layers. The density of each layer is determined using heuristics, often purely based on the size of the parameter tensor. While the connections per layer are optimized multiple times during training, the density of each layer remains constant. This leaves great unrealized potential, especially in scenarios with a high sparsity of 90% and more. We propose Global Gradient-based Redistribution, a technique which distributes weights across all layers - adding more weights to the layers that need them most. Our evaluation shows that our approach is less prone to unbalanced weight distribution at initialization than previous work and that it is able to find better performing sparse subnetworks at very high sparsity levels.
翻译:从零开始训练一个稀疏的神经网络需要与重量本身同时优化连接。 通常, 重量在预设的重量更新数之后再分配, 去除每一层参数的一小部分, 并将其插入同一层的不同位置。 每一层的密度都是使用超力学来决定的, 通常纯粹根据参数微量的大小来决定 。 虽然每层的连接在训练期间被优化过多次, 每一层的密度保持不变 。 这留下了巨大的未实现的潜力, 特别是在90%和90%以上高度宽度高的情况下。 我们建议采用基于全球梯度的再分配技术, 一种在所有层中分配重量的技术 - 增加最需要的层的重量 。 我们的评估表明, 我们的方法在初始化时不太容易出现不平衡的重量分布, 并且能够发现在非常高的宽度水平上更好地运行稀疏的子网络 。