The importance of learning rate (LR) schedules on network pruning has been observed in a few recent works. As an example, Frankle and Carbin (2019) highlighted that winning tickets (i.e., accuracy preserving subnetworks) can not be found without applying a LR warmup schedule and Renda, Frankle and Carbin (2020) demonstrated that rewinding the LR to its initial state at the end of each pruning cycle improves performance. In this paper, we go one step further by first providing a theoretical justification for the surprising effect of LR schedules. Next, we propose a LR schedule for network pruning called SILO, which stands for S-shaped Improved Learning rate Optimization. The advantages of SILO over existing state-of-the-art (SOTA) LR schedules are two-fold: (i) SILO has a strong theoretical motivation and dynamically adjusts the LR during pruning to improve generalization. Specifically, SILO increases the LR upper bound (max_lr) in an S-shape. This leads to an improvement of 2% - 4% in extensive experiments with various types of networks (e.g., Vision Transformers, ResNet) on popular datasets such as ImageNet, CIFAR-10/100. (ii) In addition to the strong theoretical motivation, SILO is empirically optimal in the sense of matching an Oracle, which exhaustively searches for the optimal value of max_lr via grid search. We find that SILO is able to precisely adjust the value of max_lr to be within the Oracle optimized interval, resulting in performance competitive with the Oracle with significantly lower complexity.
翻译:网络运行中学习率( LR) 时间表在网络运行中的重要性在近期的一些作品中得到了观察。 例如, Frankle 和 Carbin (2019年) 指出, 如果不应用 LR 暖化时间表, 赢票( 准确保存子网络) 将无法找到赢票( 即, 准确保存子网络) 最佳化。 Renda、 Frankle 和 Carbin (202020年) 表明, 将 LR 倒回每个运行周期结束时的初始状态可以提高绩效 。 在本文中, 我们更进一步一步, 首先为 LR 时间表的惊人复杂性效果提供一个理论理由。 其次, 我们提议一个网络运行的LOLL( max_ lr) 高级搜索率( SLIL ), 这导致S- Net 改进了 S- 改进率 优化率 。 IMLO 和 IMLO 等类最佳性能, 将S- 升级为S- IMLA 。