We report, for the first time, on the cascade weight shedding phenomenon in deep neural networks where in response to pruning a small percentage of a network's weights, a large percentage of the remaining is shed over a few epochs during the ensuing fine-tuning phase. We show that cascade weight shedding, when present, can significantly improve the performance of an otherwise sub-optimal scheme such as random pruning. This explains why some pruning methods may perform well under certain circumstances, but poorly under others, e.g., ResNet50 vs. MobileNetV3. We provide insight into why the global magnitude-based pruning, i.e., GMP, despite its simplicity, provides a competitive performance for a wide range of scenarios. We also demonstrate cascade weight shedding's potential for improving GMP's accuracy, and reduce its computational complexity. In doing so, we highlight the importance of pruning and learning-rate schedules. We shed light on weight and learning-rate rewinding methods of re-training, showing their possible connections to the cascade weight shedding and reason for their advantage over fine-tuning. We also investigate cascade weight shedding's effect on the set of kept weights, and its implications for semi-structured pruning. Finally, we give directions for future research.
翻译:我们第一次报告了深神经网络中的级联重量变换现象,在深神经网络中,为了应对一个网络重量的一小部分,很大一部分剩余重量在随后的微调阶段中抛出。我们显示,当目前,级联重量变换能够显著改善一个其他最优化的系统,如随机裁剪等的性能。这解释了为什么某些修剪方法在某些情况下可能表现良好,但在另一些情况下,例如在ResNet50 和 MobiveNetV3 等其他情况下,效果不佳。我们深入了解为什么全球规模的裁剪,即GMP,尽管其简单易行,却为多种情景提供了竞争性的性能。我们还展示了级联重力变换潜力,以提高GMP的准确性能,并降低其计算复杂性。我们这样做时,我们强调了修剪裁和学习率时间表的重要性。我们阐明了再培训的重量和学习速率回缩方法,展示了它们可能与级联成的重量变重的连接,也显示了它可能与升级后期的重力影响。我们还调查了其调整后重力影响。